image


Intel® 64 and IA-32 Architectures Software Developer’s Manual

Volume 2C: Instruction Set Reference, V-Z


NOTE: The Intel® 64 and IA-32 Architectures Software Developer's Manual consists of ten volumes: Basic Architecture, Order Number 253665; Instruction Set Reference A-L, Order Number 253666; Instruction Set Reference M-U, Order Number 253667; Instruction Set Reference V-Z, Order Number 326018; Instruction Set Reference, Order Number 334569; System Programming Guide, Part 1, Order Number 253668; System Programming Guide, Part 2, Order Number 253669; System Programming Guide, Part 3, Order Number 326019; System Programming Guide, Part 4, Order Number 332831; Model-Specific Registers, Order Number 335592. Refer to all ten volumes when evaluating your design needs.


Order Number: 326018-067US

May 2018



Intel technologies features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Learn more at intel.com, or from the OEM or retailer.

No computer system can be absolutely secure. Intel does not assume any liability for lost or stolen data or systems or any damages resulting from such losses.

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifica- tions. Current characterized errata are available on request.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1- 800-548-4725, or by visiting http://www.intel.com/design/literature.htm.

Intel, the Intel logo, Intel Atom, Intel Core, Intel SpeedStep, MMX, Pentium, VTune, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others. Copyright © 1997-2018, Intel Corporation. All Rights Reserved.

CHAPTER 5 INSTRUCTION SET REFERENCE, V-Z

image


    1. TERNARY BIT VECTOR LOGIC TABLE

      VPTERNLOGD/VPTERNLOGQ instructions operate on dword/qword elements and take three bit vectors of the respective input data elements to form a set of 32/64 indices, where each 3-bit value provides an index into an 8- bit lookup table represented by the imm8 byte of the instruction. The 256 possible values of the imm8 byte is constructed as a 16x16 boolean logic table. The 16 rows of the table uses the lower 4 bits of imm8 as row index. The 16 columns are referenced by imm8[7:4]. The 16 columns of the table are present in two halves, with 8 columns shown in Table 5-1 for the column index value between 0:7, followed by Table 5-2 showing the 8 columns corresponding to column index 8:15. This section presents the two-halves of the 256-entry table using a short- hand notation representing simple or compound boolean logic expressions with three input bit source data.

      The three input bit source data will be denoted with the capital letters: A, B, C; where A represents a bit from the first source operand (also the destination operand), B and C represent a bit from the 2nd and 3rd source operands.

      Each map entry takes the form of a logic expression consisting of one of more component expressions. Each component expression consists of either a unary or binary boolean operator and associated operands. Each binary boolean operator is expressed in lowercase letters, and operands concatenated after the logic operator. The unary operator ‘not’ is expressed using ‘!’. Additionally, the conditional expression “A?B:C” expresses a result returning B if A is set, returning C otherwise.

      A binary boolean operator is followed by two operands, e.g. andAB. For a compound binary expression that contain commutative components and comprising the same logic operator, the 2nd logic operator is omitted and three operands can be concatenated in sequence, e.g. andABC. When the 2nd operand of the first binary boolean expres- sion comes from the result of another boolean expression, the 2nd boolean expression is concatenated after the uppercase operand of the first logic expression, e.g. norBnandAC. When the result is independent of an operand, that operand is omitted in the logic expression, e.g. zeros or norCB.

      The 3-input expression “majorABC” returns 0 if two or more input bits are 0, returns 1 if two or more input bits are

      1. The 3-input expression “minorABC” returns 1 if two or more input bits are 0, returns 0 if two or more input bits are 1.

      The building-block bit logic functions used in Table 5-1 and Table 5-2 include;

      • Constants: TRUE (1), FALSE (0);

      • Unary function: Not (!);

      • Binary functions: and, nand, or, nor, xor, xnor;

      • Conditional function: Select (?:);

      • Tertiary functions: major, minor.


        : Table 5-1. Low 8 columns of the 16x16 Map of VPTERNLOG Boolean Logic Operations

        Imm

        [7:4]

        [3:0]

        0H

        1H

        2H

        3H

        4H

        5H

        6H

        7H

        00H

        FALSE

        andAnorBC

        norBnandAC

        andA!B

        norCnandBA

        andA!C

        andAxorBC

        andAnandBC

        01H

        norABC

        norCB

        norBxorAC

        A?!B:norBC

        norCxorBA

        A?!C:norBC

        A?xorBC:norB C

        A?nandBC:no rBC

        02H

        andCnorBA

        norBxnorAC

        andC!B

        norBnorAC

        C?norBA:and BA

        C?norBA:A

        C?!B:andBA

        C?!B:A

        03H

        norBA

        norBandAC

        C?!B:norBA

        !B

        C?norBA:xnor BA

        A?!C:!B

        A?xorBC:!B

        A?nandBC:!B

        04H

        andBnorAC

        norCxnorBA

        B?norAC:and AC

        B?norAC:A

        andB!C

        norCnorBA

        B?!C:andAC

        B?!C:A

        05H

        norCA

        norCandBA

        B?norAC:xnor AC

        A?!B:!C

        B?!C:norAC

        !C

        A?xorBC:!C

        A?nandBC:!C

        06H

        norAxnorBC

        A?norBC:xorB C

        B?norAC:C

        xorBorAC

        C?norBA:B

        xorCorBA

        xorCB

        B?!C:orAC

        07H

        norAandBC

        minorABC

        C?!B:!A

        nandBorAC

        B?!C:!A

        nandCorBA

        A?xorBC:nan dBC

        nandCB

        08H

        norAnandBC

        A?norBC:and BC

        andCxorBA

        A?!B:andBC

        andBxorAC

        A?!C:andBC

        A?xorBC:and BC

        xorAandBC

        09H

        norAxorBC

        A?norBC:xnor BC

        C?xorBA:norB A

        A?!B:xnorBC

        B?xorAC:norA C

        A?!C:xnorBC

        xnorABC

        A?nandBC:xn orBC

        0AH

        andC!A

        A?norBC:C

        andCnandBA

        A?!B:C

        C?!A:andBA

        xorCA

        xorCandBA

        A?nandBC:C

        0BH

        C?!A:norBA

        C?!A:!B

        C?nandBA:no rBA

        C?nandBA:!B

        B?xorAC:!A

        B?xorAC:nan dAC

        C?nandBA:xn orBA

        nandBxnorAC

        0CH

        andB!A

        A?norBC:B

        B?!A:andAC

        xorBA

        andBnandAC

        A?!C:B

        xorBandAC

        A?nandBC:B

        0DH

        B?!A:norAC

        B?!A:!C

        B?!A:xnorAC

        C?xorBA:nan dBA

        B?nandAC:no rAC

        B?nandAC:!C

        B?nandAC:xn orAC

        nandCxnorBA

        0EH

        norAnorBC

        xorAorBC

        B?!A:C

        A?!B:orBC

        C?!A:B

        A?!C:orBC

        B?nandAC:C

        A?nandBC:or BC

        0FH

        !A

        nandAorBC

        C?nandBA:!A

        nandBA

        B?nandAC:!A

        nandCA

        nandAxnorBC

        nandABC


        Table 5-2 shows the half of 256-entry map corresponding to column index values 8:15.


        : Table 5-2. Low 8 columns of the 16x16 Map of VPTERNLOG Boolean Logic Operations

        Imm

        [7:4]

        [3:0]

        08H

        09H

        0AH

        0BH

        0CH

        0DH

        0EH

        0FH

        00H

        andABC

        andAxnorBC

        andCA

        B?andAC:A

        andBA

        C?andBA:A

        andAorBC

        A

        01H

        A?andBC:nor BC

        B?andAC:!C

        A?C:norBC

        C?A:!B

        A?B:norBC

        B?A:!C

        xnorAorBC

        orAnorBC

        02H

        andCxnorBA

        B?andAC:xor AC

        B?andAC:C

        B?andAC:orA C

        C?xnorBA:an dBA

        B?A:xorAC

        B?A:C

        B?A:orAC

        03H

        A?andBC:!B

        xnorBandAC

        A?C:!B

        nandBnandA C

        xnorBA

        B?A:nandAC

        A?orBC:!B

        orA!B

        04H

        andBxnorAC

        C?andBA:xor BA

        B?xnorAC:an dAC

        B?xnorAC:A

        C?andBA:B

        C?andBA:orB A

        C?A:B

        C?A:orBA

        05H

        A?andBC:!C

        xnorCandBA

        xnorCA

        C?A:nandBA

        A?B:!C

        nandCnandB A

        A?orBC:!C

        orA!C

        06H

        A?andBC:xor BC

        xorABC

        A?C:xorBC

        B?xnorAC:orA C

        A?B:xorBC

        C?xnorBA:orB A

        A?orBC:xorBC

        orAxorBC

        07H

        xnorAandBC

        A?xnorBC:na ndBC

        A?C:nandBC

        nandBxorAC

        A?B:nandBC

        nandCxorBA

        A?orBCnandB C

        orAnandBC

        08H

        andCB

        A?xnorBC:an dBC

        andCorAB

        B?C:A

        andBorAC

        C?B:A

        majorABC

        orAandBC

        09H

        B?C:norAC

        xnorCB

        xnorCorBA

        C?orBA:!B

        xnorBorAC

        B?orAC:!C

        A?orBC:xnorB C

        orAxnorBC

        0AH

        A?andBC:C

        A?xnorBC:C

        C

        B?C:orAC

        A?B:C

        B?orAC:xorAC

        orCandBA

        orCA

        0BH

        B?C:!A

        B?C:nandAC

        orCnorBA

        orC!B

        B?orAC:!A

        B?orAC:nand AC

        orCxnorBA

        nandBnorAC

        0CH

        A?andBC:B

        A?xnorBC:B

        A?C:B

        C?orBA:xorBA

        B

        C?B:orBA

        orBandAC

        orBA

        0DH

        C?B!A

        C?B:nandBA

        C?orBA:!A

        C?orBA:nand BA

        orBnorAC

        orB!C

        orBxnorAC

        nandCnorBA

        0EH

        A?andBC:orB C

        A?xnorBC:orB C

        A?C:orBC

        orCxorBA

        A?B:orBC

        orBxorAC

        orCB

        orABC

        0FH

        nandAnandB C

        nandAxorBC

        orC!A

        orCnandBA

        orB!A

        orBnandAC

        nandAnorBC

        TRUE


        Table 5-1 and Table 5-2 translate each of the possible value of the imm8 byte to a Boolean expression. These tables can also be used by software to translate Boolean expressions to numerical constants to form the imm8 value needed to construct the VPTERNLOG syntax. There is a unique set of three byte constants (F0H, CCH, AAH) that can be used for this purpose as input operands in conjunction with the Boolean expressions defined in those tables. The reverse mapping can be expressed as:

        Result_imm8 = Table_Lookup_Entry( 0F0H, 0CCH, 0AAH)

        Table_Lookup_Entry is the Boolean expression defined in Table 5-1 and Table 5-2.


    2. INSTRUCTIONS (V-Z)

Chapter 5 continues an alphabetical discussion of Intel® 64 and IA-32 instructions (V-Z). See also: Chapter 3, “Instruction Set Reference, A-L,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A, and Chapter 4, “Instruction Set Reference, M-U‚” in the Intel® 64 and IA-32 Architectures Software Devel- oper’s Manual, Volume 2B.


VALIGND/VALIGNQ—Align Doubleword/Quadword Vectors

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.NDS.128.66.0F3A.W0 03 /r ib

VALIGND xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst, imm8

A

V/V

AVX512VL AVX512F

Shift right and merge vectors xmm2 and xmm3/m128/m32bcst with double-word granularity using imm8 as number of elements to shift, and store the final result in xmm1, under writemask.

EVEX.NDS.128.66.0F3A.W1 03 /r ib

VALIGNQ xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst, imm8

A

V/V

AVX512VL AVX512F

Shift right and merge vectors xmm2 and xmm3/m128/m64bcst with quad-word granularity using imm8 as number of elements to shift, and store the final result in xmm1, under writemask.

EVEX.NDS.256.66.0F3A.W0 03 /r ib

VALIGND ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst, imm8

A

V/V

AVX512VL AVX512F

Shift right and merge vectors ymm2 and ymm3/m256/m32bcst with double-word granularity using imm8 as number of elements to shift, and store the final result in ymm1, under writemask.

EVEX.NDS.256.66.0F3A.W1 03 /r ib

VALIGNQ ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst, imm8

A

V/V

AVX512VL AVX512F

Shift right and merge vectors ymm2 and ymm3/m256/m64bcst with quad-word granularity using imm8 as number of elements to shift, and store the final result in ymm1, under writemask.

EVEX.NDS.512.66.0F3A.W0 03 /r ib

VALIGND zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst, imm8

A

V/V

AVX512F

Shift right and merge vectors zmm2 and zmm3/m512/m32bcst with double-word granularity using imm8 as number of elements to shift, and store the final result in zmm1, under writemask.

EVEX.NDS.512.66.0F3A.W1 03 /r ib

VALIGNQ zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst, imm8

A

V/V

AVX512F

Shift right and merge vectors zmm2 and zmm3/m512/m64bcst with quad-word granularity using imm8 as number of elements to shift, and store the final result in zmm1, under writemask.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

NA

Description

Concatenates and shifts right doubleword/quadword elements of the first source operand (the second operand) and the second source operand (the third operand) into a 1024/512/256-bit intermediate vector. The low 512/256/128-bit of the intermediate vector is written to the destination operand (the first operand) using the writemask k1. The destination and first source operands are ZMM/YMM/XMM registers. The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location.

This instruction is writemasked, so only those elements with the corresponding bit set in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with the corresponding bit clear in k1 retain their previous values (merging-masking) or are set to 0 (zeroing-masking).



Operation

VALIGND (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)


IF (SRC2 *is memory*) (AND EVEX.b = 1) THEN

FOR j 0 TO KL-1

i j * 32

src[i+31:i] SRC2[31:0] ENDFOR;

ELSE src SRC2

FI

; Concatenate sources tmp[VL-1:0] src[VL-1:0]

tmp[2VL-1:VL] SRC1[VL-1:0]

; Shift right doubleword elements IF VL = 128

THEN SHIFT = imm8[1:0] ELSE

IF VL = 256

THEN SHIFT = imm8[2:0] ELSE SHIFT = imm8[3:0]

FI

FI;

tmp[2VL-1:0] tmp[2VL-1:0] >> (32*SHIFT)

; Apply writemask FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] tmp[i+31:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR;

DEST[MAXVL-1:VL] 0



VALIGNQ (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256),(8, 512)

IF (SRC2 *is memory*) (AND EVEX.b = 1) THEN

FOR j 0 TO KL-1

i j * 64

src[i+63:i] SRC2[63:0] ENDFOR;

ELSE src SRC2

FI

; Concatenate sources tmp[VL-1:0] src[VL-1:0]

tmp[2VL-1:VL] SRC1[VL-1:0]

; Shift right quadword elements IF VL = 128

THEN SHIFT = imm8[0] ELSE

IF VL = 256

THEN SHIFT = imm8[1:0] ELSE SHIFT = imm8[2:0]

FI

FI;

tmp[2VL-1:0] tmp[2VL-1:0] >> (64*SHIFT)

; Apply writemask FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] tmp[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR;

DEST[MAXVL-1:VL] 0



Intel C/C++ Compiler Intrinsic Equivalent

VALIGND m512i _mm512_alignr_epi32( m512i a, m512i b, int cnt);

VALIGND m512i _mm512_mask_alignr_epi32( m512i s, mmask16 k, m512i a, m512i b, int cnt); VALIGND m512i _mm512_maskz_alignr_epi32( mmask16 k, m512i a, m512i b, int cnt);

VALIGND m256i _mm256_mask_alignr_epi32( m256i s, mmask8 k, m256i a, m256i b, int cnt); VALIGND m256i _mm256_maskz_alignr_epi32( mmask8 k, m256i a, m256i b, int cnt);

VALIGND m128i _mm_mask_alignr_epi32( m128i s, mmask8 k, m128i a, m128i b, int cnt); VALIGND m128i _mm_maskz_alignr_epi32( mmask8 k, m128i a, m128i b, int cnt);

VALIGNQ m512i _mm512_alignr_epi64( m512i a, m512i b, int cnt);

VALIGNQ m512i _mm512_mask_alignr_epi64( m512i s, mmask8 k, m512i a, m512i b, int cnt); VALIGNQ m512i _mm512_maskz_alignr_epi64( mmask8 k, m512i a, m512i b, int cnt);

VALIGNQ m256i _mm256_mask_alignr_epi64( m256i s, mmask8 k, m256i a, m256i b, int cnt); VALIGNQ m256i _mm256_maskz_alignr_epi64( mmask8 k, m256i a, m256i b, int cnt);

VALIGNQ m128i _mm_mask_alignr_epi64( m128i s, mmask8 k, m128i a, m128i b, int cnt); VALIGNQ m128i _mm_maskz_alignr_epi64( mmask8 k, m128i a, m128i b, int cnt);


Exceptions

See Exceptions Type E4NF.


VBLENDMPD/VBLENDMPS—Blend Float64/Float32 Vectors Using an OpMask Control

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.NDS.128.66.0F38.W1 65 /r VBLENDMPD xmm1 {k1}{z},

xmm2, xmm3/m128/m64bcst

A

V/V

AVX512VL AVX512F

Blend double-precision vector xmm2 and double-precision vector xmm3/m128/m64bcst and store the result in xmm1, under control mask.

EVEX.NDS.256.66.0F38.W1 65 /r VBLENDMPD ymm1 {k1}{z},

ymm2, ymm3/m256/m64bcst

A

V/V

AVX512VL AVX512F

Blend double-precision vector ymm2 and double-precision vector ymm3/m256/m64bcst and store the result in ymm1, under control mask.

EVEX.NDS.512.66.0F38.W1 65 /r VBLENDMPD zmm1 {k1}{z},

zmm2, zmm3/m512/m64bcst

A

V/V

AVX512F

Blend double-precision vector zmm2 and double-precision vector zmm3/m512/m64bcst and store the result in zmm1, under control mask.

EVEX.NDS.128.66.0F38.W0 65 /r VBLENDMPS xmm1 {k1}{z},

xmm2, xmm3/m128/m32bcst

A

V/V

AVX512VL AVX512F

Blend single-precision vector xmm2 and single-precision vector xmm3/m128/m32bcst and store the result in xmm1, under control mask.

EVEX.NDS.256.66.0F38.W0 65 /r VBLENDMPS ymm1 {k1}{z},

ymm2, ymm3/m256/m32bcst

A

V/V

AVX512VL AVX512F

Blend single-precision vector ymm2 and single-precision vector ymm3/m256/m32bcst and store the result in ymm1, under control mask.

EVEX.NDS.512.66.0F38.W0 65 /r VBLENDMPS zmm1 {k1}{z},

zmm2, zmm3/m512/m32bcst

A

V/V

AVX512F

Blend single-precision vector zmm2 and single-precision vector zmm3/m512/m32bcst using k1 as select control and store the result in zmm1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

NA

Description

Performs an element-by-element blending between float64/float32 elements in the first source operand (the second operand) with the elements in the second source operand (the third operand) using an opmask register as select control. The blended result is written to the destination register.

The destination and first source operands are ZMM/YMM/XMM registers. The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64- bit memory location.

The opmask register is not used as a writemask for this instruction. Instead, the mask is used as an element selector: every element of the destination is conditionally selected between first source or second source using the value of the related mask bit (0 for first source operand, 1 for second source operand).

If EVEX.z is set, the elements with corresponding mask bit value of 0 in the destination operand are zeroed.



Operation

VBLENDMPD (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no controlmask* THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN

DEST[i+63:i] SRC2[63:0] ELSE

DEST[i+63:i] SRC2[i+63:i]

FI;

ELSE

IF *merging-masking* ; merging-masking THEN DEST[i+63:i] SRC1[i+63:i]

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI;

FI;

ENDFOR

DEST[MAXVL-1:VL] 0

VBLENDMPS (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no controlmask* THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN

DEST[i+31:i] SRC2[31:0] ELSE

DEST[i+31:i] SRC2[i+31:i]

FI;

ELSE

IF *merging-masking* ; merging-masking THEN DEST[i+31:i] SRC1[i+31:i]

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI;

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



Intel C/C++ Compiler Intrinsic Equivalent

VBLENDMPD m512d _mm512_mask_blend_pd( mmask8 k, m512d a, m512d b); VBLENDMPD m256d _mm256_mask_blend_pd( mmask8 k, m256d a, m256d b); VBLENDMPD m128d _mm_mask_blend_pd( mmask8 k, m128d a, m128d b); VBLENDMPS m512 _mm512_mask_blend_ps( mmask16 k, m512 a, m512 b); VBLENDMPS m256 _mm256_mask_blend_ps( mmask8 k, m256 a, m256 b); VBLENDMPS m128 _mm_mask_blend_ps( mmask8 k, m128 a, m128 b);


SIMD Floating-Point Exceptions

None


Other Exceptions

See Exceptions Type E4.


VBROADCAST—Load with Broadcast Floating-Point Data

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

VEX.128.66.0F38.W0 18 /r VBROADCASTSS xmm1, m32

A

V/V

AVX

Broadcast single-precision floating-point element in mem to four locations in xmm1.

VEX.256.66.0F38.W0 18 /r VBROADCASTSS ymm1, m32

A

V/V

AVX

Broadcast single-precision floating-point element in mem to eight locations in ymm1.

VEX.256.66.0F38.W0 19 /r VBROADCASTSD ymm1, m64

A

V/V

AVX

Broadcast double-precision floating-point element in mem to four locations in ymm1.

VEX.256.66.0F38.W0 1A /r VBROADCASTF128 ymm1, m128

A

V/V

AVX

Broadcast 128 bits of floating-point data in mem to low and high 128-bits in ymm1.

VEX.128.66.0F38.W0 18/r VBROADCASTSS xmm1, xmm2

A

V/V

AVX2

Broadcast the low single-precision floating-point element in the source operand to four locations in xmm1.

VEX.256.66.0F38.W0 18 /r

VBROADCASTSS ymm1, xmm2

A

V/V

AVX2

Broadcast low single-precision floating-point element in the source operand to eight locations in ymm1.

VEX.256.66.0F38.W0 19 /r

VBROADCASTSD ymm1, xmm2

A

V/V

AVX2

Broadcast low double-precision floating-point element in the source operand to four locations in ymm1.

EVEX.256.66.0F38.W1 19 /r VBROADCASTSD ymm1 {k1}{z},

xmm2/m64

B

V/V

AVX512VL AVX512F

Broadcast low double-precision floating-point element in xmm2/m64 to four locations in ymm1 using writemask k1.

EVEX.512.66.0F38.W1 19 /r VBROADCASTSD zmm1 {k1}{z},

xmm2/m64

B

V/V

AVX512F

Broadcast low double-precision floating-point element in xmm2/m64 to eight locations in zmm1 using writemask k1.

EVEX.256.66.0F38.W0 19 /r VBROADCASTF32X2 ymm1 {k1}{z},

xmm2/m64

C

V/V

AVX512VL AVX512DQ

Broadcast two single-precision floating-point elements in xmm2/m64 to locations in ymm1 using writemask k1.

EVEX.512.66.0F38.W0 19 /r VBROADCASTF32X2 zmm1 {k1}{z},

xmm2/m64

C

V/V

AVX512DQ

Broadcast two single-precision floating-point elements in xmm2/m64 to locations in zmm1 using writemask k1.

EVEX.128.66.0F38.W0 18 /r VBROADCASTSS xmm1 {k1}{z},

xmm2/m32

B

V/V

AVX512VL AVX512F

Broadcast low single-precision floating-point element in xmm2/m32 to all locations in xmm1 using writemask k1.

EVEX.256.66.0F38.W0 18 /r VBROADCASTSS ymm1 {k1}{z},

xmm2/m32

B

V/V

AVX512VL AVX512F

Broadcast low single-precision floating-point element in xmm2/m32 to all locations in ymm1 using writemask k1.

EVEX.512.66.0F38.W0 18 /r VBROADCASTSS zmm1 {k1}{z},

xmm2/m32

B

V/V

AVX512F

Broadcast low single-precision floating-point element in xmm2/m32 to all locations in zmm1 using writemask k1.

EVEX.256.66.0F38.W0 1A /r VBROADCASTF32X4 ymm1 {k1}{z}, m128

D

V/V

AVX512VL AVX512F

Broadcast 128 bits of 4 single-precision floating-point data in mem to locations in ymm1 using writemask k1.

EVEX.512.66.0F38.W0 1A /r VBROADCASTF32X4 zmm1 {k1}{z}, m128

D

V/V

AVX512F

Broadcast 128 bits of 4 single-precision floating-point data in mem to locations in zmm1 using writemask k1.

EVEX.256.66.0F38.W1 1A /r VBROADCASTF64X2 ymm1 {k1}{z}, m128

C

V/V

AVX512VL AVX512DQ

Broadcast 128 bits of 2 double-precision floating-point data in mem to locations in ymm1 using writemask k1.


Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.512.66.0F38.W1 1A /r VBROADCASTF64X2 zmm1 {k1}{z}, m128

C

V/V

AVX512DQ

Broadcast 128 bits of 2 double-precision floating-point data in mem to locations in zmm1 using writemask k1.

EVEX.512.66.0F38.W0 1B /r VBROADCASTF32X8 zmm1 {k1}{z}, m256

E

V/V

AVX512DQ

Broadcast 256 bits of 8 single-precision floating-point data in mem to locations in zmm1 using writemask k1.

EVEX.512.66.0F38.W1 1B /r VBROADCASTF64X4 zmm1 {k1}{z}, m256

D

V/V

AVX512F

Broadcast 256 bits of 4 double-precision floating-point data in mem to locations in zmm1 using writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

B

Tuple1 Scalar

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

C

Tuple2

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

D

Tuple4

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

E

Tuple8

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

VBROADCASTSD/VBROADCASTSS/VBROADCASTF128 load floating-point values as one tuple from the source operand (second operand) in memory and broadcast to all elements of the destination operand (first operand).

VEX256-encoded versions: The destination operand is a YMM register. The source operand is either a 32-bit, 64- bit, or 128-bit memory location. Register source encodings are reserved and will #UD. Bits (MAXVL-1:256) of the destination register are zeroed.

EVEX-encoded versions: The destination operand is a ZMM/YMM/XMM register and updated according to the writemask k1. The source operand is either a 32-bit, 64-bit memory location or the low doubleword/quadword element of an XMM register.

VBROADCASTF32X2/VBROADCASTF32X4/VBROADCASTF64X2/VBROADCASTF32X8/VBROADCASTF64X4 load

floating-point values as tuples from the source operand (the second operand) in memory or register and broadcast to all elements of the destination operand (the first operand). The destination operand is a YMM/ZMM register updated according to the writemask k1. The source operand is either a register or 64-bit/128-bit/256-bit memory location.

VBROADCASTSD and VBROADCASTF128,F32x4 and F64x2 are only supported as 256-bit and 512-bit wide versions and up. VBROADCASTSS is supported in 128-bit, 256-bit and 512-bit wide versions. F32x8 and F64x4 are only supported as 512-bit wide versions.

VBROADCASTF32X2/VBROADCASTF32X4/VBROADCASTF32X8 have 32-bit granularity. VBROADCASTF64X2 and

VBROADCASTF64X4 have 64-bit granularity.

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

If VBROADCASTSD or VBROADCASTF128 is encoded with VEX.L= 0, an attempt to execute the instruction encoded with VEX.L= 0 will cause an #UD exception.


image

X0

m32


X0

X0

X0

X0

X0

X0

X0

X0

DEST


Figure 5-1. VBROADCASTSS Operation (VEX.256 encoded version)


image

X0

m32


0

0

0

0

X0

X0

X0

X0

DEST


Figure 5-2. VBROADCASTSS Operation (VEX.128-bit version)


image

X0

m64


X0

X0

X0

X0

DEST


image

Figure 5-3. VBROADCASTSD Operation (VEX.256-bit version)


m128

X0


DEST

X0

X0


Figure 5-4. VBROADCASTF128 Operation (VEX.256-bit version)


m256

X0


DEST

X0

X0


image

Figure 5-5. VBROADCASTF64X4 Operation (512-bit version with writemask all 1s)



Operation

VBROADCASTSS (128 bit version VEX and legacy)

temp SRC[31:0] DEST[31:0] temp DEST[63:32] temp DEST[95:64] temp DEST[127:96] temp DEST[MAXVL-1:128] 0


VBROADCASTSS (VEX.256 encoded version)

temp SRC[31:0] DEST[31:0] temp DEST[63:32] temp DEST[95:64] temp DEST[127:96] temp DEST[159:128] temp DEST[191:160] temp DEST[223:192] temp DEST[255:224] temp DEST[MAXVL-1:256] 0


VBROADCASTSS (EVEX encoded versions)

(KL, VL) (4, 128), (8, 256),= (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] SRC[31:0] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VBROADCASTSD (VEX.256 encoded version)

temp SRC[63:0] DEST[63:0] temp DEST[127:64] temp DEST[191:128] temp DEST[255:192] temp DEST[MAXVL-1:256] 0


VBROADCASTSD (EVEX encoded versions)

(KL, VL) = (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[63:0] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VBROADCASTF32x2 (EVEX encoded versions)

(KL, VL) = (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

n (j mod 2) * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] SRC[n+31:n] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VBROADCASTF128 (VEX.256 encoded version)

temp SRC[127:0] DEST[127:0] temp DEST[255:128] temp DEST[MAXVL-1:256] 0



VBROADCASTF32X4 (EVEX encoded versions)

(KL, VL) = (8, 256), (16, 512)

FOR j 0 TO KL-1

i j* 32

n (j modulo 4) * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] SRC[n+31:n] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VBROADCASTF64X2 (EVEX encoded versions)

(KL, VL) = (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

n (j modulo 2) * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[n+63:n] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] = 0

FI

FI;

ENDFOR;


VBROADCASTF32X8 (EVEX.U1.512 encoded version)

FOR j 0 TO 15

i j * 32

n (j modulo 8) * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] SRC[n+31:n] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VBROADCASTF64X4 (EVEX.512 encoded version)

FOR j 0 TO 7

i j * 64

n (j modulo 4) * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[n+63:n] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


Intel C/C++ Compiler Intrinsic Equivalent

VBROADCASTF32x2 m512 _mm512_broadcast_f32x2( m128 a);

VBROADCASTF32x2 m512 _mm512_mask_broadcast_f32x2( m512 s, mmask16 k, m128 a); VBROADCASTF32x2 m512 _mm512_maskz_broadcast_f32x2( mmask16 k, m128 a); VBROADCASTF32x2 m256 _mm256_broadcast_f32x2( m128 a);

VBROADCASTF32x2 m256 _mm256_mask_broadcast_f32x2( m256 s, mmask8 k, m128 a); VBROADCASTF32x2 m256 _mm256_maskz_broadcast_f32x2( mmask8 k, m128 a); VBROADCASTF32x4 m512 _mm512_broadcast_f32x4( m128 a);

VBROADCASTF32x4 m512 _mm512_mask_broadcast_f32x4( m512 s, mmask16 k, m128 a); VBROADCASTF32x4 m512 _mm512_maskz_broadcast_f32x4( mmask16 k, m128 a); VBROADCASTF32x4 m256 _mm256_broadcast_f32x4( m128 a);

VBROADCASTF32x4 m256 _mm256_mask_broadcast_f32x4( m256 s, mmask8 k, m128 a); VBROADCASTF32x4 m256 _mm256_maskz_broadcast_f32x4( mmask8 k, m128 a); VBROADCASTF32x8 m512 _mm512_broadcast_f32x8( m256 a);

VBROADCASTF32x8 m512 _mm512_mask_broadcast_f32x8( m512 s, mmask16 k, m256 a); VBROADCASTF32x8 m512 _mm512_maskz_broadcast_f32x8( mmask16 k, m256 a); VBROADCASTF64x2 m512d _mm512_broadcast_f64x2( m128d a);

VBROADCASTF64x2 m512d _mm512_mask_broadcast_f64x2( m512d s, mmask8 k, m128d a); VBROADCASTF64x2 m512d _mm512_maskz_broadcast_f64x2( mmask8 k, m128d a); VBROADCASTF64x2 m256d _mm256_broadcast_f64x2( m128d a);

VBROADCASTF64x2 m256d _mm256_mask_broadcast_f64x2( m256d s, mmask8 k, m128d a); VBROADCASTF64x2 m256d _mm256_maskz_broadcast_f64x2( mmask8 k, m128d a); VBROADCASTF64x4 m512d _mm512_broadcast_f64x4( m256d a);

VBROADCASTF64x4 m512d _mm512_mask_broadcast_f64x4( m512d s, mmask8 k, m256d a); VBROADCASTF64x4 m512d _mm512_maskz_broadcast_f64x4( mmask8 k, m256d a); VBROADCASTSD m512d _mm512_broadcastsd_pd( m128d a);

VBROADCASTSD m512d _mm512_mask_broadcastsd_pd( m512d s, mmask8 k, m128d a); VBROADCASTSD m512d _mm512_maskz_broadcastsd_pd( mmask8 k, m128d a); VBROADCASTSD m256d _mm256_broadcastsd_pd( m128d a);

VBROADCASTSD m256d _mm256_mask_broadcastsd_pd( m256d s, mmask8 k, m128d a); VBROADCASTSD m256d _mm256_maskz_broadcastsd_pd( mmask8 k, m128d a); VBROADCASTSD m256d _mm256_broadcast_sd(double *a);

VBROADCASTSS m512 _mm512_broadcastss_ps( m128 a);

VBROADCASTSS m512 _mm512_mask_broadcastss_ps( m512 s, mmask16 k, m128 a); VBROADCASTSS m512 _mm512_maskz_broadcastss_ps( mmask16 k, m128 a); VBROADCASTSS m256 _mm256_broadcastss_ps( m128 a);

VBROADCASTSS m256 _mm256_mask_broadcastss_ps( m256 s, mmask8 k, m128 a); VBROADCASTSS m256 _mm256_maskz_broadcastss_ps( mmask8 k, m128 a);



VBROADCASTSS m128 _mm_broadcastss_ps( m128 a);

VBROADCASTSS m128 _mm_mask_broadcastss_ps( m128 s, mmask8 k, m128 a); VBROADCASTSS m128 _mm_maskz_broadcastss_ps( mmask8 k, m128 a); VBROADCASTSS m128 _mm_broadcast_ss(float *a);

VBROADCASTSS m256 _mm256_broadcast_ss(float *a); VBROADCASTF128 m256 _mm256_broadcast_ps( m128 * a); VBROADCASTF128 m256d _mm256_broadcast_pd( m128d * a);


Exceptions

VEX-encoded instructions, see Exceptions Type 6; EVEX-encoded instructions, see Exceptions Type E6.

#UD If VEX.L = 0 for VBROADCASTSD or VBROADCASTF128.

If EVEX.L’L = 0 for VBROADCASTSD/VBROADCASTF32X2/VBROADCASTF32X4/VBROADCASTF64X2. If EVEX.L’L < 10b for VBROADCASTF32X8/VBROADCASTF64X4.


VCOMPRESSPD—Store Sparse Packed Double-Precision Floating-Point Values into Dense Memory

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.66.0F38.W1 8A /r VCOMPRESSPD xmm1/m128 {k1}{z},

xmm2

A

V/V

AVX512VL AVX512F

Compress packed double-precision floating-point values from xmm2 to xmm1/m128 using writemask k1.

EVEX.256.66.0F38.W1 8A /r VCOMPRESSPD ymm1/m256 {k1}{z},

ymm2

A

V/V

AVX512VL AVX512F

Compress packed double-precision floating-point values from ymm2 to ymm1/m256 using writemask k1.

EVEX.512.66.0F38.W1 8A /r VCOMPRESSPD zmm1/m512 {k1}{z},

zmm2

A

V/V

AVX512F

Compress packed double-precision floating-point values from zmm2 using control mask k1 to zmm1/m512.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Tuple1 Scalar

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

Description

Compress (store) up to 8 double-precision floating-point values from the source operand (the second operand) as a contiguous vector to the destination operand (the first operand) The source operand is a ZMM/YMM/XMM register, the destination operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location.

The opmask register k1 selects the active elements (partial vector or possibly non-contiguous if less than 8 active elements) from the source operand to compress into a contiguous vector. The contiguous vector is written to the destination starting from the low element of the destination operand.

Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z must be zero.

Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper bits are zeroed.

EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element instead of the size of the full vector.


Operation

VCOMPRESSPD (EVEX encoded versions) store form

(KL, VL) = (2, 128), (4, 256), (8, 512)

SIZE 64

k 0

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask* THEN

DEST[k+SIZE-1:k] SRC[i+63:i]

k k + SIZE

FI;

ENDFOR



VCOMPRESSPD (EVEX encoded versions) reg-reg form

(KL, VL) = (2, 128), (4, 256), (8, 512)

SIZE 64

k 0

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask* THEN

DEST[k+SIZE-1:k] SRC[i+63:i]

k k + SIZE

FI;

ENDFOR

IF *merging-masking*

THEN *DEST[VL-1:k] remains unchanged*

ELSE DEST[VL-1:k] 0

FI

DEST[MAXVL-1:VL] 0


Intel C/C++ Compiler Intrinsic Equivalent

VCOMPRESSPD m512d _mm512_mask_compress_pd( m512d s, mmask8 k, m512d a); VCOMPRESSPD m512d _mm512_maskz_compress_pd( mmask8 k, m512d a); VCOMPRESSPD void _mm512_mask_compressstoreu_pd( void * d, mmask8 k, m512d a); VCOMPRESSPD m256d _mm256_mask_compress_pd( m256d s, mmask8 k, m256d a); VCOMPRESSPD m256d _mm256_maskz_compress_pd( mmask8 k, m256d a); VCOMPRESSPD void _mm256_mask_compressstoreu_pd( void * d, mmask8 k, m256d a); VCOMPRESSPD m128d _mm_mask_compress_pd( m128d s, mmask8 k, m128d a); VCOMPRESSPD m128d _mm_maskz_compress_pd( mmask8 k, m128d a); VCOMPRESSPD void _mm_mask_compressstoreu_pd( void * d, mmask8 k, m128d a);


SIMD Floating-Point Exceptions

None


Other Exceptions

EVEX-encoded instructions, see Exceptions Type E4.nb.

#UD If EVEX.vvvv != 1111B.


VCOMPRESSPS—Store Sparse Packed Single-Precision Floating-Point Values into Dense Memory

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.66.0F38.W0 8A /r VCOMPRESSPS xmm1/m128 {k1}{z},

xmm2

A

V/V

AVX512VL AVX512F

Compress packed single-precision floating-point values from xmm2 to xmm1/m128 using writemask k1.

EVEX.256.66.0F38.W0 8A /r VCOMPRESSPS ymm1/m256 {k1}{z},

ymm2

A

V/V

AVX512VL AVX512F

Compress packed single-precision floating-point values from ymm2 to ymm1/m256 using writemask k1.

EVEX.512.66.0F38.W0 8A /r VCOMPRESSPS zmm1/m512 {k1}{z},

zmm2

A

V/V

AVX512F

Compress packed single-precision floating-point values from zmm2 using control mask k1 to zmm1/m512.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Tuple1 Scalar

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

Description

Compress (stores) up to 16 single-precision floating-point values from the source operand (the second operand) to the destination operand (the first operand). The source operand is a ZMM/YMM/XMM register, the destination operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location.

The opmask register k1 selects the active elements (a partial vector or possibly non-contiguous if less than 16 active elements) from the source operand to compress into a contiguous vector. The contiguous vector is written to the destination starting from the low element of the destination operand.

Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z must be zero.

Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper bits are zeroed.

EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element instead of the size of the full vector.


Operation

VCOMPRESSPS (EVEX encoded versions) store form

(KL, VL) = (4, 128), (8, 256), (16, 512)

SIZE 32

k 0

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask* THEN

DEST[k+SIZE-1:k] SRC[i+31:i]

k k + SIZE

FI;

ENDFOR;



VCOMPRESSPS (EVEX encoded versions) reg-reg form

(KL, VL) = (4, 128), (8, 256), (16, 512)

SIZE 32

k 0

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask* THEN

DEST[k+SIZE-1:k] SRC[i+31:i]

k k + SIZE

FI;

ENDFOR

IF *merging-masking*

THEN *DEST[VL-1:k] remains unchanged* ELSE DEST[VL-1:k] 0

FI

DEST[MAXVL-1:VL] 0


Intel C/C++ Compiler Intrinsic Equivalent

VCOMPRESSPS m512 _mm512_mask_compress_ps( m512 s, mmask16 k, m512 a); VCOMPRESSPS m512 _mm512_maskz_compress_ps( mmask16 k, m512 a); VCOMPRESSPS void _mm512_mask_compressstoreu_ps( void * d, mmask16 k, m512 a); VCOMPRESSPS m256 _mm256_mask_compress_ps( m256 s, mmask8 k, m256 a); VCOMPRESSPS m256 _mm256_maskz_compress_ps( mmask8 k, m256 a); VCOMPRESSPS void _mm256_mask_compressstoreu_ps( void * d, mmask8 k, m256 a); VCOMPRESSPS m128 _mm_mask_compress_ps( m128 s, mmask8 k, m128 a); VCOMPRESSPS m128 _mm_maskz_compress_ps( mmask8 k, m128 a); VCOMPRESSPS void _mm_mask_compressstoreu_ps( void * d, mmask8 k, m128 a);


SIMD Floating-Point Exceptions

None


Other Exceptions

EVEX-encoded instructions, see Exceptions Type E4.nb.

#UD If EVEX.vvvv != 1111B.


VCVTPD2QQ—Convert Packed Double-Precision Floating-Point Values to Packed Quadword Integers

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.66.0F.W1 7B /r VCVTPD2QQ xmm1 {k1}{z},

xmm2/m128/m64bcst

A

V/V

AVX512VL AVX512DQ

Convert two packed double-precision floating-point values from xmm2/m128/m64bcst to two packed quadword integers in xmm1 with writemask k1.

EVEX.256.66.0F.W1 7B /r VCVTPD2QQ ymm1 {k1}{z},

ymm2/m256/m64bcst

A

V/V

AVX512VL AVX512DQ

Convert four packed double-precision floating-point values from ymm2/m256/m64bcst to four packed quadword integers in ymm1 with writemask k1.

EVEX.512.66.0F.W1 7B /r VCVTPD2QQ zmm1 {k1}{z},

zmm2/m512/m64bcst{er}

A

V/V

AVX512DQ

Convert eight packed double-precision floating-point values from zmm2/m512/m64bcst to eight packed quadword integers in zmm1 with writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Converts packed double-precision floating-point values in the source operand (second operand) to packed quad- word integers in the destination operand (first operand).

EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operation is a ZMM/YMM/XMM register conditionally updated with writemask k1.

When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register or the embedded rounding control bits. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (2w-1, where w represents the number of bits in the destination format) is returned.

EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.



Operation

VCVTPD2QQ (EVEX encoded version) when src operand is a register

(KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL == 512) AND (EVEX.b == 1)

THEN

SET_RM(EVEX.RC);

ELSE

SET_RM(MXCSR.RM);

FI;


FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask* THEN DEST[i+63:i]

Convert_Double_Precision_Floating_Point_To_QuadInteger(SRC[i+63:i]) ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VCVTPD2QQ (EVEX encoded version) when src operand is a memory source

(KL, VL) = (2, 128), (4, 256), (8, 512)


FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask* THEN

IF (EVEX.b == 1) THEN

DEST[i+63:i] Convert_Double_Precision_Floating_Point_To_QuadInteger(SRC[63:0]) ELSE

DEST[i+63:i] Convert_Double_Precision_Floating_Point_To_QuadInteger(SRC[i+63:i])

FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



Intel C/C++ Compiler Intrinsic Equivalent

VCVTPD2QQ m512i _mm512_cvtpd_epi64( m512d a);

VCVTPD2QQ m512i _mm512_mask_cvtpd_epi64( m512i s, mmask8 k, m512d a); VCVTPD2QQ m512i _mm512_maskz_cvtpd_epi64( mmask8 k, m512d a); VCVTPD2QQ m512i _mm512_cvt_roundpd_epi64( m512d a, int r);

VCVTPD2QQ m512i _mm512_mask_cvt_roundpd_epi64( m512i s, mmask8 k, m512d a, int r); VCVTPD2QQ m512i _mm512_maskz_cvt_roundpd_epi64( mmask8 k, m512d a, int r); VCVTPD2QQ m256i _mm256_mask_cvtpd_epi64( m256i s, mmask8 k, m256d a); VCVTPD2QQ m256i _mm256_maskz_cvtpd_epi64( mmask8 k, m256d a);

VCVTPD2QQ m128i _mm_mask_cvtpd_epi64( m128i s, mmask8 k, m128d a); VCVTPD2QQ m128i _mm_maskz_cvtpd_epi64( mmask8 k, m128d a); VCVTPD2QQ m256i _mm256_cvtpd_epi64 ( m256d src)

VCVTPD2QQ m128i _mm_cvtpd_epi64 ( m128d src)


SIMD Floating-Point Exceptions

Invalid, Precision


Other Exceptions

EVEX-encoded instructions, see Exceptions Type E2

#UD If EVEX.vvvv != 1111B.


VCVTPD2UDQ—Convert Packed Double-Precision Floating-Point Values to Packed Unsigned Doubleword Integers

Opcode Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.0F.W1 79 /r VCVTPD2UDQ xmm1 {k1}{z},

xmm2/m128/m64bcst

A

V/V

AVX512VL AVX512F

Convert two packed double-precision floating-point values in xmm2/m128/m64bcst to two unsigned doubleword integers in xmm1 subject to writemask k1.

EVEX.256.0F.W1 79 /r VCVTPD2UDQ xmm1 {k1}{z},

ymm2/m256/m64bcst

A

V/V

AVX512VL AVX512F

Convert four packed double-precision floating-point values in ymm2/m256/m64bcst to four unsigned doubleword integers in xmm1 subject to writemask k1.

EVEX.512.0F.W1 79 /r VCVTPD2UDQ ymm1 {k1}{z},

zmm2/m512/m64bcst{er}

A

V/V

AVX512F

Convert eight packed double-precision floating-point values in zmm2/m512/m64bcst to eight unsigned doubleword integers in ymm1 subject to writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Converts packed double-precision floating-point values in the source operand (the second operand) to packed unsigned doubleword integers in the destination operand (the first operand).

When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register or the embedded rounding control bits. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.

The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1. The upper bits (MAXVL-1:256) of the corresponding destination are zeroed.

EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.



Operation

VCVTPD2UDQ (EVEX encoded versions) when src2 operand is a register

(KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1)

THEN

SET_RM(EVEX.RC);

ELSE

SET_RM(MXCSR.RM);

FI;


FOR j 0 TO KL-1

i j * 32 k j * 64

IF k1[j] OR *no writemask* THEN

DEST[i+31:i]

Convert_Double_Precision_Floating_Point_To_UInteger(SRC[k+63:k]) ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL/2] 0


VCVTPD2UDQ (EVEX encoded versions) when src operand is a memory source

(KL, VL) = (2, 128), (4, 256), (8, 512)


FOR j 0 TO KL-1

i j * 32 k j * 64

IF k1[j] OR *no writemask* THEN

IF (EVEX.b = 1) THEN

DEST[i+31:i]

Convert_Double_Precision_Floating_Point_To_UInteger(SRC[63:0]) ELSE

DEST[i+31:i]

Convert_Double_Precision_Floating_Point_To_UInteger(SRC[k+63:k]) FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL/2] 0



Intel C/C++ Compiler Intrinsic Equivalent

VCVTPD2UDQ m256i _mm512_cvtpd_epu32( m512d a);

VCVTPD2UDQ m256i _mm512_mask_cvtpd_epu32( m256i s, mmask8 k, m512d a); VCVTPD2UDQ m256i _mm512_maskz_cvtpd_epu32( mmask8 k, m512d a); VCVTPD2UDQ m256i _mm512_cvt_roundpd_epu32( m512d a, int r);

VCVTPD2UDQ m256i _mm512_mask_cvt_roundpd_epu32( m256i s, mmask8 k, m512d a, int r); VCVTPD2UDQ m256i _mm512_maskz_cvt_roundpd_epu32( mmask8 k, m512d a, int r); VCVTPD2UDQ m128i _mm256_mask_cvtpd_epu32( m128i s, mmask8 k, m256d a); VCVTPD2UDQ m128i _mm256_maskz_cvtpd_epu32( mmask8 k, m256d a);

VCVTPD2UDQ m128i _mm_mask_cvtpd_epu32( m128i s, mmask8 k, m128d a); VCVTPD2UDQ m128i _mm_maskz_cvtpd_epu32( mmask8 k, m128d a);


SIMD Floating-Point Exceptions

Invalid, Precision


Other Exceptions

EVEX-encoded instructions, see Exceptions Type E2.

#UD If EVEX.vvvv != 1111B.


VCVTPD2UQQ—Convert Packed Double-Precision Floating-Point Values to Packed Unsigned Quadword Integers

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.66.0F.W1 79 /r VCVTPD2UQQ xmm1 {k1}{z},

xmm2/m128/m64bcst

A

V/V

AVX512VL AVX512DQ

Convert two packed double-precision floating-point values from xmm2/mem to two packed unsigned quadword integers in xmm1 with writemask k1.

EVEX.256.66.0F.W1 79 /r VCVTPD2UQQ ymm1 {k1}{z},

ymm2/m256/m64bcst

A

V/V

AVX512VL AVX512DQ

Convert fourth packed double-precision floating-point values from ymm2/mem to four packed unsigned quadword integers in ymm1 with writemask k1.

EVEX.512.66.0F.W1 79 /r VCVTPD2UQQ zmm1 {k1}{z},

zmm2/m512/m64bcst{er}

A

V/V

AVX512DQ

Convert eight packed double-precision floating-point values from zmm2/mem to eight packed unsigned quadword integers in zmm1 with writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Converts packed double-precision floating-point values in the source operand (second operand) to packed unsigned quadword integers in the destination operand (first operand).

When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register or the embedded rounding control bits. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.

The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operation is a ZMM/YMM/XMM register conditionally updated with writemask k1.

EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.



Operation

VCVTPD2UQQ (EVEX encoded versions) when src operand is a register

(KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL == 512) AND (EVEX.b == 1)

THEN

SET_RM(EVEX.RC);

ELSE

SET_RM(MXCSR.RM);

FI;


FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask* THEN DEST[i+63:i]

Convert_Double_Precision_Floating_Point_To_UQuadInteger(SRC[i+63:i]) ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VCVTPD2UQQ (EVEX encoded versions) when src operand is a memory source

(KL, VL) = (2, 128), (4, 256), (8, 512)


FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask* THEN

IF (EVEX.b == 1) THEN

DEST[i+63:i]

Convert_Double_Precision_Floating_Point_To_UQuadInteger(SRC[63:0]) ELSE

DEST[i+63:i]

Convert_Double_Precision_Floating_Point_To_UQuadInteger(SRC[i+63:i]) FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



Intel C/C++ Compiler Intrinsic Equivalent

VCVTPD2UQQ m512i _mm512_cvtpd_epu64( m512d a);

VCVTPD2UQQ m512i _mm512_mask_cvtpd_epu64( m512i s, mmask8 k, m512d a); VCVTPD2UQQ m512i _mm512_maskz_cvtpd_epu64( mmask8 k, m512d a); VCVTPD2UQQ m512i _mm512_cvt_roundpd_epu64( m512d a, int r);

VCVTPD2UQQ m512i _mm512_mask_cvt_roundpd_epu64( m512i s, mmask8 k, m512d a, int r); VCVTPD2UQQ m512i _mm512_maskz_cvt_roundpd_epu64( mmask8 k, m512d a, int r); VCVTPD2UQQ m256i _mm256_mask_cvtpd_epu64( m256i s, mmask8 k, m256d a); VCVTPD2UQQ m256i _mm256_maskz_cvtpd_epu64( mmask8 k, m256d a);

VCVTPD2UQQ m128i _mm_mask_cvtpd_epu64( m128i s, mmask8 k, m128d a); VCVTPD2UQQ m128i _mm_maskz_cvtpd_epu64( mmask8 k, m128d a); VCVTPD2UQQ m256i _mm256_cvtpd_epu64 ( m256d src)

VCVTPD2UQQ m128i _mm_cvtpd_epu64 ( m128d src)


SIMD Floating-Point Exceptions

Invalid, Precision


Other Exceptions

EVEX-encoded instructions, see Exceptions Type E2

#UD If EVEX.vvvv != 1111B.


VCVTPH2PS—Convert 16-bit FP values to Single-Precision FP values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

VEX.128.66.0F38.W0 13 /r

VCVTPH2PS xmm1, xmm2/m64

A

V/V

F16C

Convert four packed half precision (16-bit) floating- point values in xmm2/m64 to packed single-precision floating-point value in xmm1.

VEX.256.66.0F38.W0 13 /r

VCVTPH2PS ymm1, xmm2/m128

A

V/V

F16C

Convert eight packed half precision (16-bit) floating- point values in xmm2/m128 to packed single- precision floating-point value in ymm1.

EVEX.128.66.0F38.W0 13 /r

VCVTPH2PS xmm1 {k1}{z}, xmm2/m64

B

V/V

AVX512VL AVX512F

Convert four packed half precision (16-bit) floating- point values in xmm2/m64 to packed single-precision floating-point values in xmm1.

EVEX.256.66.0F38.W0 13 /r VCVTPH2PS ymm1 {k1}{z},

xmm2/m128

B

V/V

AVX512VL AVX512F

Convert eight packed half precision (16-bit) floating- point values in xmm2/m128 to packed single- precision floating-point values in ymm1.

EVEX.512.66.0F38.W0 13 /r VCVTPH2PS zmm1 {k1}{z},

ymm2/m256 {sae}

B

V/V

AVX512F

Convert sixteen packed half precision (16-bit) floating-point values in ymm2/m256 to packed single-precision floating-point values in zmm1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

B

Half Mem

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Converts packed half precision (16-bits) floating-point values in the low-order bits of the source operand (the second operand) to packed single-precision floating-point values and writes the converted values into the destina- tion operand (the first operand).

If case of a denormal operand, the correct normal result is returned. MXCSR.DAZ is ignored and is treated as if it

  1. No denormal exception is reported on MXCSR.

    VEX.128 version: The source operand is a XMM register or 64-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of the corresponding destination register are zeroed.

    VEX.256 version: The source operand is a XMM register or 128-bit memory location. The destination operand is a YMM register. Bits (MAXVL-1:256) of the corresponding destination register are zeroed.

    EVEX encoded versions: The source operand is a YMM/XMM/XMM (low 64-bits) register or a 256/128/64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1.

    The diagram below illustrates how data is converted from four packed half precision (in 64 bits) to four single preci- sion (in 128 bits) FP values.

    Note: VEX.vvvv and EVEX.vvvv are reserved (must be 1111b).



    VCVTPH2PS xmm1, xmm2/mem64, imm8


    127

    96 95

    64 63

    48 47

    32 31

    16 15

    0


    VH3 VH2 VH1 VH0 xmm2/mem64


    convert

    convert


    convert

    convert


    127

    96 95

    64 63

    32 31

    0


    VS3 VS2 VS1 VS0 xmm1


    image

    Figure 5-6. VCVTPH2PS (128-bit Version)



    Operation

    vCvt_h2s(SRC1[15:0])

    {

    RETURN Cvt_Half_Precision_To_Single_Precision(SRC1[15:0]);

    }


    VCVTPH2PS (EVEX encoded versions)

    (KL, VL) = (4, 128), (8, 256), (16, 512)

    FOR j 0 TO KL-1

    i j * 32 k j * 16

    IF k1[j] OR *no writemask* THEN DEST[i+31:i]

    vCvt_h2s(SRC[k+15:k]) ELSE

    IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

    ELSE ; zeroing-masking

    DEST[i+31:i] 0

    FI

    FI;

    ENDFOR

    DEST[MAXVL-1:VL] 0


    VCVTPH2PS (VEX.256 encoded version) DEST[31:0] vCvt_h2s(SRC1[15:0]); DEST[63:32] vCvt_h2s(SRC1[31:16]); DEST[95:64] vCvt_h2s(SRC1[47:32]); DEST[127:96] vCvt_h2s(SRC1[63:48]); DEST[159:128] vCvt_h2s(SRC1[79:64]); DEST[191:160] vCvt_h2s(SRC1[95:80]); DEST[223:192] vCvt_h2s(SRC1[111:96]);

    DEST[255:224] vCvt_h2s(SRC1[127:112]);

    DEST[MAXVL-1:256] 0



    VCVTPH2PS (VEX.128 encoded version) DEST[31:0] vCvt_h2s(SRC1[15:0]); DEST[63:32] vCvt_h2s(SRC1[31:16]); DEST[95:64] vCvt_h2s(SRC1[47:32]); DEST[127:96] vCvt_h2s(SRC1[63:48]); DEST[MAXVL-1:128] 0


    Flags Affected

    None


    Intel C/C++ Compiler Intrinsic Equivalent

    VCVTPH2PS m512 _mm512_cvtph_ps( m256i a);

    VCVTPH2PS m512 _mm512_mask_cvtph_ps( m512 s, mmask16 k, m256i a); VCVTPH2PS m512 _mm512_maskz_cvtph_ps( mmask16 k, m256i a); VCVTPH2PS m512 _mm512_cvt_roundph_ps( m256i a, int sae);

    VCVTPH2PS m512 _mm512_mask_cvt_roundph_ps( m512 s, mmask16 k, m256i a, int sae); VCVTPH2PS m512 _mm512_maskz_cvt_roundph_ps( mmask16 k, m256i a, int sae); VCVTPH2PS m256 _mm256_mask_cvtph_ps( m256 s, mmask8 k, m128i a);

    VCVTPH2PS m256 _mm256_maskz_cvtph_ps( mmask8 k, m128i a); VCVTPH2PS m128 _mm_mask_cvtph_ps( m128 s, mmask8 k, m128i a); VCVTPH2PS m128 _mm_maskz_cvtph_ps( mmask8 k, m128i a); VCVTPH2PS m128 _mm_cvtph_ps ( m128i m1);

    VCVTPH2PS m256 _mm256_cvtph_ps ( m128i m1)


    SIMD Floating-Point Exceptions

    Invalid


    Other Exceptions

    VEX-encoded instructions, see Exceptions Type 11 (do not report #AC); EVEX-encoded instructions, see Exceptions Type E11.

    #UD If VEX.W=1.

    #UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.


    VCVTPS2PH—Convert Single-Precision FP value to 16-bit FP value

    Opcode/ Instruction

    Op / En

    64/32

    bit Mode Support

    CPUID

    Feature Flag

    Description

    VEX.128.66.0F3A.W0 1D /r ib

    VCVTPS2PH xmm1/m64, xmm2, imm8

    A

    V/V

    F16C

    Convert four packed single-precision floating-point values in xmm2 to packed half-precision (16-bit) floating-point values in xmm1/m64. Imm8 provides rounding controls.

    VEX.256.66.0F3A.W0 1D /r ib

    VCVTPS2PH xmm1/m128, ymm2, imm8

    A

    V/V

    F16C

    Convert eight packed single-precision floating-point values in ymm2 to packed half-precision (16-bit) floating-point values in xmm1/m128. Imm8 provides rounding controls.

    EVEX.128.66.0F3A.W0 1D /r ib VCVTPS2PH xmm1/m64 {k1}{z},

    xmm2, imm8

    B

    V/V

    AVX512VL AVX512F

    Convert four packed single-precision floating-point values in xmm2 to packed half-precision (16-bit) floating-point values in xmm1/m64. Imm8 provides rounding controls.

    EVEX.256.66.0F3A.W0 1D /r ib VCVTPS2PH xmm1/m128 {k1}{z},

    ymm2, imm8

    B

    V/V

    AVX512VL AVX512F

    Convert eight packed single-precision floating-point values in ymm2 to packed half-precision (16-bit) floating-point values in xmm1/m128. Imm8 provides rounding controls.

    EVEX.512.66.0F3A.W0 1D /r ib VCVTPS2PH ymm1/m256 {k1}{z},

    zmm2{sae}, imm8

    B

    V/V

    AVX512F

    Convert sixteen packed single-precision floating-point values in zmm2 to packed half-precision (16-bit) floating- point values in ymm1/m256. Imm8 provides rounding controls.


    Instruction Operand Encoding

    Op/En

    Tuple Type

    Operand 1

    Operand 2

    Operand 3

    Operand 4

    A

    NA

    ModRM:r/m (w)

    ModRM:reg (r)

    Imm8

    NA

    B

    Half Mem

    ModRM:r/m (w)

    ModRM:reg (r)

    Imm8

    NA

    Description

    Convert packed single-precision floating values in the source operand to half-precision (16-bit) floating-point values and store to the destination operand. The rounding mode is specified using the immediate field (imm8).

    image

    Underflow results (i.e., tiny results) are converted to denormals. MXCSR.FTZ is ignored. If a source element is denormal relative to the input format with DM masked and at least one of PM or UM unmasked; a SIMD exception will be raised with DE, UE and PE set.


    VCVTPS2PH xmm1/mem64, xmm2, imm8


    127

    96 95

    64 63

    32 31

    0


    VS3 VS2 VS1 VS0 xmm2


    convert

    convert

    convert

    convert


    127

    96 95

    64 63

    48 47

    32 31

    16 15

    0


    VH3 VH2 VH1 VH0 xmm1/mem64


    Figure 5-7. VCVTPS2PH (128-bit Version)



    The immediate byte defines several bit fields that control rounding operation. The effect and encoding of the RC field are listed in Table 5-3.


    Table 5-3. Immediate Byte Encoding for 16-bit Floating-Point Conversion Instructions

    Bits

    Field Name/value

    Description

    Comment

    Imm[1:0]

    RC=00B

    Round to nearest even

    If Imm[2] = 0

    RC=01B

    Round down

    RC=10B

    Round up

    RC=11B

    Truncate

    Imm[2]

    MS1=0

    Use imm[1:0] for rounding

    Ignore MXCSR.RC

    MS1=1

    Use MXCSR.RC for rounding


    Imm[7:3]

    Ignored

    Ignored by processor


    VEX.128 version: The source operand is a XMM register. The destination operand is a XMM register or 64-bit memory location. If the destination operand is a register then the upper bits (MAXVL-1:64) of corresponding register are zeroed.

    VEX.256 version: The source operand is a YMM register. The destination operand is a XMM register or 128-bit memory location. If the destination operand is a register, the upper bits (MAXVL-1:128) of the corresponding desti- nation register are zeroed.

    Note: VEX.vvvv and EVEX.vvvv are reserved (must be 1111b).

    EVEX encoded versions: The source operand is a ZMM/YMM/XMM register. The destination operand is a YMM/XMM/XMM (low 64-bits) register or a 256/128/64-bit memory location, conditionally updated with writemask k1. Bits (MAXVL-1:256/128/64) of the corresponding destination register are zeroed.


    Operation

    vCvt_s2h(SRC1[31:0])

    {

    IF Imm[2] = 0

    THEN ; using Imm[1:0] for rounding control, see Table 5-3

    RETURN Cvt_Single_Precision_To_Half_Precision_FP_Imm(SRC1[31:0]); ELSE ; using MXCSR.RC for rounding control

    RETURN Cvt_Single_Precision_To_Half_Precision_FP_Mxcsr(SRC1[31:0]);

    FI;

    }


    VCVTPS2PH (EVEX encoded versions) when dest is a register

    (KL, VL) = (4, 128), (8, 256), (16, 512)

    FOR j 0 TO KL-1

    i j * 16 k j * 32

    IF k1[j] OR *no writemask* THEN DEST[i+15:i]

    vCvt_s2h(SRC[k+31:k]) ELSE

    IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged*

    ELSE ; zeroing-masking

    DEST[i+15:i] 0

    FI

    FI;

    ENDFOR

    DEST[MAXVL-1:VL/2] 0



    VCVTPS2PH (EVEX encoded versions) when dest is memory

    (KL, VL) = (4, 128), (8, 256), (16, 512)

    FOR j 0 TO KL-1

    i j * 16 k j * 32

    IF k1[j] OR *no writemask* THEN DEST[i+15:i]

    vCvt_s2h(SRC[k+31:k]) ELSE

    *DEST[i+15:i] remains unchanged* ; merging-masking

    FI;

    ENDFOR


    VCVTPS2PH (VEX.256 encoded version) DEST[15:0] vCvt_s2h(SRC1[31:0]); DEST[31:16] vCvt_s2h(SRC1[63:32]); DEST[47:32] vCvt_s2h(SRC1[95:64]); DEST[63:48] vCvt_s2h(SRC1[127:96]); DEST[79:64] vCvt_s2h(SRC1[159:128]); DEST[95:80] vCvt_s2h(SRC1[191:160]); DEST[111:96] vCvt_s2h(SRC1[223:192]);

    DEST[127:112] vCvt_s2h(SRC1[255:224]);

    DEST[MAXVL-1:128] 0


    VCVTPS2PH (VEX.128 encoded version) DEST[15:0] vCvt_s2h(SRC1[31:0]); DEST[31:16] vCvt_s2h(SRC1[63:32]); DEST[47:32] vCvt_s2h(SRC1[95:64]); DEST[63:48] vCvt_s2h(SRC1[127:96]); DEST[MAXVL-1:64] 0


    Flags Affected

    None


    Intel C/C++ Compiler Intrinsic Equivalent

    VCVTPS2PH m256i _mm512_cvtps_ph( m512 a);

    VCVTPS2PH m256i _mm512_mask_cvtps_ph( m256i s, mmask16 k, m512 a); VCVTPS2PH m256i _mm512_maskz_cvtps_ph( mmask16 k, m512 a); VCVTPS2PH m256i _mm512_cvt_roundps_ph( m512 a, const int imm);

    VCVTPS2PH m256i _mm512_mask_cvt_roundps_ph( m256i s, mmask16 k, m512 a, const int imm); VCVTPS2PH m256i _mm512_maskz_cvt_roundps_ph( mmask16 k, m512 a, const int imm); VCVTPS2PH m128i _mm256_mask_cvtps_ph( m128i s, mmask8 k, m256 a);

    VCVTPS2PH m128i _mm256_maskz_cvtps_ph( mmask8 k, m256 a); VCVTPS2PH m128i _mm_mask_cvtps_ph( m128i s, mmask8 k, m128 a); VCVTPS2PH m128i _mm_maskz_cvtps_ph( mmask8 k, m128 a); VCVTPS2PH m128i _mm_cvtps_ph ( m128 m1, const int imm);

    VCVTPS2PH m128i _mm256_cvtps_ph( m256 m1, const int imm);


    SIMD Floating-Point Exceptions

    Invalid, Underflow, Overflow, Precision, Denormal (if MXCSR.DAZ=0);



    Other Exceptions

    VEX-encoded instructions, see Exceptions Type 11 (do not report #AC); EVEX-encoded instructions, see Exceptions Type E11.

    #UD If VEX.W=1.

    #UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.


    VCVTPS2UDQ—Convert Packed Single-Precision Floating-Point Values to Packed Unsigned Doubleword Integer Values

    Opcode/ Instruction

    Op / En

    64/32

    bit Mode Support

    CPUID

    Feature Flag

    Description

    EVEX.128.0F.W0 79 /r VCVTPS2UDQ xmm1 {k1}{z},

    xmm2/m128/m32bcst

    A

    V/V

    AVX512VL AVX512F

    Convert four packed single precision floating-point values from xmm2/m128/m32bcst to four packed unsigned doubleword values in xmm1 subject to writemask k1.

    EVEX.256.0F.W0 79 /r VCVTPS2UDQ ymm1 {k1}{z},

    ymm2/m256/m32bcst

    A

    V/V

    AVX512VL AVX512F

    Convert eight packed single precision floating-point values from ymm2/m256/m32bcst to eight packed unsigned doubleword values in ymm1 subject to writemask k1.

    EVEX.512.0F.W0 79 /r VCVTPS2UDQ zmm1 {k1}{z},

    zmm2/m512/m32bcst{er}

    A

    V/V

    AVX512F

    Convert sixteen packed single-precision floating-point values from zmm2/m512/m32bcst to sixteen packed unsigned doubleword values in zmm1 subject to writemask k1.


    Instruction Operand Encoding

    Op/En

    Tuple Type

    Operand 1

    Operand 2

    Operand 3

    Operand 4

    A

    Full

    ModRM:reg (w)

    ModRM:r/m (r)

    NA

    NA

    Description

    Converts sixteen packed single-precision floating-point values in the source operand to sixteen unsigned double- word integers in the destination operand.

    When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register or the embedded rounding control bits. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.

    The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1.

    Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.



    Operation

    VCVTPS2UDQ (EVEX encoded versions) when src operand is a register

    (KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1)

    THEN

    SET_RM(EVEX.RC);

    ELSE

    SET_RM(MXCSR.RM);

    FI;


    FOR j 0 TO KL-1

    i j * 32

    IF k1[j] OR *no writemask* THEN DEST[i+31:i]

    Convert_Single_Precision_Floating_Point_To_UInteger(SRC[i+31:i]) ELSE

    IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

    ELSE ; zeroing-masking

    DEST[i+31:i] 0

    FI

    FI;

    ENDFOR

    DEST[MAXVL-1:VL] 0


    VCVTPS2UDQ (EVEX encoded versions) when src operand is a memory source

    (KL, VL) = (4, 128), (8, 256), (16, 512)


    FOR j 0 TO KL-1

    i j * 32

    IF k1[j] OR *no writemask* THEN

    IF (EVEX.b = 1) THEN

    DEST[i+31:i]

    Convert_Single_Precision_Floating_Point_To_UInteger(SRC[31:0]) ELSE

    DEST[i+31:i]

    Convert_Single_Precision_Floating_Point_To_UInteger(SRC[i+31:i]) FI;

    ELSE

    IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

    ELSE ; zeroing-masking

    DEST[i+31:i] 0

    FI

    FI;

    ENDFOR

    DEST[MAXVL-1:VL] 0



    Intel C/C++ Compiler Intrinsic Equivalent

    VCVTPS2UDQ m512i _mm512_cvtps_epu32( m512 a);

    VCVTPS2UDQ m512i _mm512_mask_cvtps_epu32( m512i s, mmask16 k, m512 a); VCVTPS2UDQ m512i _mm512_maskz_cvtps_epu32( mmask16 k, m512 a); VCVTPS2UDQ m512i _mm512_cvt_roundps_epu32( m512 a, int r);

    VCVTPS2UDQ m512i _mm512_mask_cvt_roundps_epu32( m512i s, mmask16 k, m512 a, int r); VCVTPS2UDQ m512i _mm512_maskz_cvt_roundps_epu32( mmask16 k, m512 a, int r); VCVTPS2UDQ m256i _mm256_cvtps_epu32( m256d a);

    VCVTPS2UDQ m256i _mm256_mask_cvtps_epu32( m256i s, mmask8 k, m256 a); VCVTPS2UDQ m256i _mm256_maskz_cvtps_epu32( mmask8 k, m256 a); VCVTPS2UDQ m128i _mm_cvtps_epu32( m128 a);

    VCVTPS2UDQ m128i _mm_mask_cvtps_epu32( m128i s, mmask8 k, m128 a); VCVTPS2UDQ m128i _mm_maskz_cvtps_epu32( mmask8 k, m128 a);


    SIMD Floating-Point Exceptions

    Invalid, Precision


    Other Exceptions

    EVEX-encoded instructions, see Exceptions Type E2.

    #UD If EVEX.vvvv != 1111B.


    VCVTPS2QQ—Convert Packed Single Precision Floating-Point Values to Packed Singed Quadword Integer Values

    Opcode/ Instruction

    Op / En

    64/32

    bit Mode Support

    CPUID

    Feature Flag

    Description

    EVEX.128.66.0F.W0 7B /r VCVTPS2QQ xmm1 {k1}{z},

    xmm2/m64/m32bcst

    A

    V/V

    AVX512VL AVX512DQ

    Convert two packed single precision floating-point values from xmm2/m64/m32bcst to two packed signed quadword values in xmm1 subject to writemask k1.

    EVEX.256.66.0F.W0 7B /r VCVTPS2QQ ymm1 {k1}{z},

    xmm2/m128/m32bcst

    A

    V/V

    AVX512VL AVX512DQ

    Convert four packed single precision floating-point values from xmm2/m128/m32bcst to four packed signed quadword values in ymm1 subject to writemask k1.

    EVEX.512.66.0F.W0 7B /r VCVTPS2QQ zmm1 {k1}{z},

    ymm2/m256/m32bcst{er}

    A

    V/V

    AVX512DQ

    Convert eight packed single precision floating-point values from ymm2/m256/m32bcst to eight packed signed quadword values in zmm1 subject to writemask k1.


    Instruction Operand Encoding

    Op/En

    Tuple Type

    Operand 1

    Operand 2

    Operand 3

    Operand 4

    A

    Half

    ModRM:reg (w)

    ModRM:r/m (r)

    NA

    NA

    Description

    Converts eight packed single-precision floating-point values in the source operand to eight signed quadword inte- gers in the destination operand.

    When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register or the embedded rounding control bits. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (2w-1, where w represents the number of bits in the destination format) is returned.

    The source operand is a YMM/XMM/XMM (low 64- bits) register or a 256/128/64-bit memory location. The destina- tion operation is a ZMM/YMM/XMM register conditionally updated with writemask k1.

    Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.



    Operation

    VCVTPS2QQ (EVEX encoded versions) when src operand is a register

    (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL == 512) AND (EVEX.b == 1)

    THEN

    SET_RM(EVEX.RC);

    ELSE

    SET_RM(MXCSR.RM);

    FI;

    FOR j 0 TO KL-1

    i j * 64 k j * 32

    IF k1[j] OR *no writemask* THEN DEST[i+63:i]

    Convert_Single_Precision_To_QuadInteger(SRC[k+31:k]) ELSE

    IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

    ELSE ; zeroing-masking

    DEST[i+63:i] 0

    FI

    FI;

    ENDFOR

    DEST[MAXVL-1:VL] 0


    VCVTPS2QQ (EVEX encoded versions) when src operand is a memory source

    (KL, VL) = (2, 128), (4, 256), (8, 512)


    FOR j 0 TO KL-1

    i j * 64 k j * 32

    IF k1[j] OR *no writemask* THEN

    IF (EVEX.b == 1) THEN

    DEST[i+63:i]

    Convert_Single_Precision_To_QuadInteger(SRC[31:0]) ELSE

    DEST[i+63:i]

    Convert_Single_Precision_To_QuadInteger(SRC[k+31:k]) FI;

    ELSE

    IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

    ELSE ; zeroing-masking

    DEST[i+63:i] 0

    FI

    FI;

    ENDFOR

    DEST[MAXVL-1:VL] 0



    Intel C/C++ Compiler Intrinsic Equivalent

    VCVTPS2QQ m512i _mm512_cvtps_epi64( m512 a);

    VCVTPS2QQ m512i _mm512_mask_cvtps_epi64( m512i s, mmask16 k, m512 a); VCVTPS2QQ m512i _mm512_maskz_cvtps_epi64( mmask16 k, m512 a); VCVTPS2QQ m512i _mm512_cvt_roundps_epi64( m512 a, int r);

    VCVTPS2QQ m512i _mm512_mask_cvt_roundps_epi64( m512i s, mmask16 k, m512 a, int r); VCVTPS2QQ m512i _mm512_maskz_cvt_roundps_epi64( mmask16 k, m512 a, int r); VCVTPS2QQ m256i _mm256_cvtps_epi64( m256 a);

    VCVTPS2QQ m256i _mm256_mask_cvtps_epi64( m256i s, mmask8 k, m256 a); VCVTPS2QQ m256i _mm256_maskz_cvtps_epi64( mmask8 k, m256 a); VCVTPS2QQ m128i _mm_cvtps_epi64( m128 a);

    VCVTPS2QQ m128i _mm_mask_cvtps_epi64( m128i s, mmask8 k, m128 a); VCVTPS2QQ m128i _mm_maskz_cvtps_epi64( mmask8 k, m128 a);


    SIMD Floating-Point Exceptions

    Invalid, Precision


    Other Exceptions

    EVEX-encoded instructions, see Exceptions Type E3

    #UD If EVEX.vvvv != 1111B.


    VCVTPS2UQQ—Convert Packed Single Precision Floating-Point Values to Packed Unsigned Quadword Integer Values

    Opcode/ Instruction

    Op / En

    64/32

    bit Mode Support

    CPUID

    Feature Flag

    Description

    EVEX.128.66.0F.W0 79 /r VCVTPS2UQQ xmm1 {k1}{z},

    xmm2/m64/m32bcst

    A

    V/V

    AVX512VL AVX512DQ

    Convert two packed single precision floating-point values from zmm2/m64/m32bcst to two packed unsigned quadword values in zmm1 subject to writemask k1.

    EVEX.256.66.0F.W0 79 /r VCVTPS2UQQ ymm1 {k1}{z},

    xmm2/m128/m32bcst

    A

    V/V

    AVX512VL AVX512DQ

    Convert four packed single precision floating-point values from xmm2/m128/m32bcst to four packed unsigned quadword values in ymm1 subject to writemask k1.

    EVEX.512.66.0F.W0 79 /r VCVTPS2UQQ zmm1 {k1}{z},

    ymm2/m256/m32bcst{er}

    A

    V/V

    AVX512DQ

    Convert eight packed single precision floating-point values from ymm2/m256/m32bcst to eight packed unsigned quadword values in zmm1 subject to writemask k1.


    Instruction Operand Encoding

    Op/En

    Tuple Type

    Operand 1

    Operand 2

    Operand 3

    Operand 4

    A

    Half

    ModRM:reg (w)

    ModRM:r/m (r)

    NA

    NA

    Description

    Converts up to eight packed single-precision floating-point values in the source operand to unsigned quadword integers in the destination operand.

    When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register or the embedded rounding control bits. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.

    The source operand is a YMM/XMM/XMM (low 64- bits) register or a 256/128/64-bit memory location. The destina- tion operation is a ZMM/YMM/XMM register conditionally updated with writemask k1.

    EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.



    Operation

    VCVTPS2UQQ (EVEX encoded versions) when src operand is a register

    (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL == 512) AND (EVEX.b == 1)

    THEN

    SET_RM(EVEX.RC);

    ELSE

    SET_RM(MXCSR.RM);

    FI;

    FOR j 0 TO KL-1

    i j * 64 k j * 32

    IF k1[j] OR *no writemask* THEN DEST[i+63:i]

    Convert_Single_Precision_To_UQuadInteger(SRC[k+31:k]) ELSE

    IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

    ELSE ; zeroing-masking

    DEST[i+63:i] 0

    FI

    FI;

    ENDFOR

    DEST[MAXVL-1:VL] 0


    VCVTPS2UQQ (EVEX encoded versions) when src operand is a memory source

    (KL, VL) = (2, 128), (4, 256), (8, 512)


    FOR j 0 TO KL-1

    i j * 64 k j * 32

    IF k1[j] OR *no writemask* THEN

    IF (EVEX.b == 1) THEN

    DEST[i+63:i]

    Convert_Single_Precision_To_UQuadInteger(SRC[31:0]) ELSE

    DEST[i+63:i]

    Convert_Single_Precision_To_UQuadInteger(SRC[k+31:k]) FI;

    ELSE

    IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

    ELSE ; zeroing-masking

    DEST[i+63:i] 0

    FI

    FI;

    ENDFOR

    DEST[MAXVL-1:VL] 0



    Intel C/C++ Compiler Intrinsic Equivalent

    VCVTPS2UQQ m512i _mm512_cvtps_epu64( m512 a);

    VCVTPS2UQQ m512i _mm512_mask_cvtps_epu64( m512i s, mmask16 k, m512 a); VCVTPS2UQQ m512i _mm512_maskz_cvtps_epu64( mmask16 k, m512 a); VCVTPS2UQQ m512i _mm512_cvt_roundps_epu64( m512 a, int r);

    VCVTPS2UQQ m512i _mm512_mask_cvt_roundps_epu64( m512i s, mmask16 k, m512 a, int r); VCVTPS2UQQ m512i _mm512_maskz_cvt_roundps_epu64( mmask16 k, m512 a, int r); VCVTPS2UQQ m256i _mm256_cvtps_epu64( m256 a);

    VCVTPS2UQQ m256i _mm256_mask_cvtps_epu64( m256i s, mmask8 k, m256 a); VCVTPS2UQQ m256i _mm256_maskz_cvtps_epu64( mmask8 k, m256 a); VCVTPS2UQQ m128i _mm_cvtps_epu64( m128 a);

    VCVTPS2UQQ m128i _mm_mask_cvtps_epu64( m128i s, mmask8 k, m128 a); VCVTPS2UQQ m128i _mm_maskz_cvtps_epu64( mmask8 k, m128 a);


    SIMD Floating-Point Exceptions

    Invalid, Precision


    Other Exceptions

    EVEX-encoded instructions, see Exceptions Type E3

    #UD If EVEX.vvvv != 1111B.


    VCVTQQ2PD—Convert Packed Quadword Integers to Packed Double-Precision Floating-Point Values

    Opcode/ Instruction

    Op / En

    64/32

    bit Mode Support

    CPUID

    Feature Flag

    Description

    EVEX.128.F3.0F.W1 E6 /r VCVTQQ2PD xmm1 {k1}{z},

    xmm2/m128/m64bcst

    A

    V/V

    AVX512VL AVX512DQ

    Convert two packed quadword integers from xmm2/m128/m64bcst to packed double-precision floating- point values in xmm1 with writemask k1.

    EVEX.256.F3.0F.W1 E6 /r VCVTQQ2PD ymm1 {k1}{z},

    ymm2/m256/m64bcst

    A

    V/V

    AVX512VL AVX512DQ

    Convert four packed quadword integers from ymm2/m256/m64bcst to packed double-precision floating- point values in ymm1 with writemask k1.

    EVEX.512.F3.0F.W1 E6 /r VCVTQQ2PD zmm1 {k1}{z},

    zmm2/m512/m64bcst{er}

    A

    V/V

    AVX512DQ

    Convert eight packed quadword integers from zmm2/m512/m64bcst to eight packed double-precision floating-point values in zmm1 with writemask k1.


    Instruction Operand Encoding

    Op/En

    Tuple Type

    Operand 1

    Operand 2

    Operand 3

    Operand 4

    A

    Full

    ModRM:reg (w)

    ModRM:r/m (r)

    NA

    NA

    Description

    Converts packed quadword integers in the source operand (second operand) to packed double-precision floating- point values in the destination operand (first operand).

    The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operation is a ZMM/YMM/XMM register conditionally updated with writemask k1.

    EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.


    Operation

    VCVTQQ2PD (EVEX2 encoded versions) when src operand is a register

    (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL == 512) AND (EVEX.b == 1)

    THEN

    SET_RM(EVEX.RC);

    ELSE

    SET_RM(MXCSR.RM);

    FI;


    FOR j 0 TO KL-1

    i j * 64

    IF k1[j] OR *no writemask* THEN DEST[i+63:i]

    Convert_QuadInteger_To_Double_Precision_Floating_Point(SRC[i+63:i]) ELSE

    IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

    ELSE ; zeroing-masking

    DEST[i+63:i] 0

    FI

    FI;

    ENDFOR

    DEST[MAXVL-1:VL] 0



    VCVTQQ2PD (EVEX encoded versions) when src operand is a memory source

    (KL, VL) = (2, 128), (4, 256), (8, 512)

    FOR j 0 TO KL-1

    i j * 64

    IF k1[j] OR *no writemask* THEN

    IF (EVEX.b == 1) THEN

    DEST[i+63:i]

    Convert_QuadInteger_To_Double_Precision_Floating_Point(SRC[63:0]) ELSE

    DEST[i+63:i]

    Convert_QuadInteger_To_Double_Precision_Floating_Point(SRC[i+63:i]) FI;

    ELSE

    IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

    ELSE ; zeroing-masking

    DEST[i+63:i] 0

    FI

    FI;

    ENDFOR

    DEST[MAXVL-1:VL] 0


    Intel C/C++ Compiler Intrinsic Equivalent

    VCVTQQ2PD m512d _mm512_cvtepi64_pd( m512i a);

    VCVTQQ2PD m512d _mm512_mask_cvtepi64_pd( m512d s, mmask16 k, m512i a); VCVTQQ2PD m512d _mm512_maskz_cvtepi64_pd( mmask16 k, m512i a); VCVTQQ2PD m512d _mm512_cvt_roundepi64_pd( m512i a, int r);

    VCVTQQ2PD m512d _mm512_mask_cvt_roundepi64_pd( m512d s, mmask8 k, m512i a, int r); VCVTQQ2PD m512d _mm512_maskz_cvt_roundepi64_pd( mmask8 k, m512i a, int r); VCVTQQ2PD m256d _mm256_mask_cvtepi64_pd( m256d s, mmask8 k, m256i a); VCVTQQ2PD m256d _mm256_maskz_cvtepi64_pd( mmask8 k, m256i a);

    VCVTQQ2PD m128d _mm_mask_cvtepi64_pd( m128d s, mmask8 k, m128i a); VCVTQQ2PD m128d _mm_maskz_cvtepi64_pd( mmask8 k, m128i a);


    SIMD Floating-Point Exceptions

    Precision


    Other Exceptions

    EVEX-encoded instructions, see Exceptions Type E2

    #UD If EVEX.vvvv != 1111B.


    VCVTQQ2PS—Convert Packed Quadword Integers to Packed Single-Precision Floating-Point Values

    Opcode/ Instruction

    Op / En

    64/32

    bit Mode Support

    CPUID

    Feature Flag

    Description

    EVEX.128.0F.W1 5B /r VCVTQQ2PS xmm1 {k1}{z},

    xmm2/m128/m64bcst

    A

    V/V

    AVX512VL AVX512DQ

    Convert two packed quadword integers from xmm2/mem to packed single-precision floating-point values in xmm1 with writemask k1.

    EVEX.256.0F.W1 5B /r VCVTQQ2PS xmm1 {k1}{z},

    ymm2/m256/m64bcst

    A

    V/V

    AVX512VL AVX512DQ

    Convert four packed quadword integers from ymm2/mem to packed single-precision floating-point values in xmm1 with writemask k1.

    EVEX.512.0F.W1 5B /r VCVTQQ2PS ymm1 {k1}{z},

    zmm2/m512/m64bcst{er}

    A

    V/V

    AVX512DQ

    Convert eight packed quadword integers from zmm2/mem to eight packed single-precision floating-point values in ymm1 with writemask k1.


    Instruction Operand Encoding

    Op/En

    Tuple Type

    Operand 1

    Operand 2

    Operand 3

    Operand 4

    A

    Full

    ModRM:reg (w)

    ModRM:r/m (r)

    NA

    NA

    Description

    Converts packed quadword integers in the source operand (second operand) to packed single-precision floating- point values in the destination operand (first operand).

    The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operation is a YMM/XMM/XMM (lower 64 bits) register conditionally updated with writemask k1.

    EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.


    Operation

    VCVTQQ2PS (EVEX encoded versions) when src operand is a register

    (KL, VL) = (2, 128), (4, 256), (8, 512)


    FOR j 0 TO KL-1

    i j * 64 k j * 32

    IF k1[j] OR *no writemask* THEN DEST[k+31:k]

    Convert_QuadInteger_To_Single_Precision_Floating_Point(SRC[i+63:i]) ELSE

    IF *merging-masking* ; merging-masking THEN *DEST[k+31:k] remains unchanged*

    ELSE ; zeroing-masking

    DEST[k+31:k] 0

    FI

    FI;

    ENDFOR

    DEST[MAXVL-1:VL/2] 0



    VCVTQQ2PS (EVEX encoded versions) when src operand is a memory source

    (KL, VL) = (2, 128), (4, 256), (8, 512)


    FOR j 0 TO KL-1

    i j * 64 k j * 32

    IF k1[j] OR *no writemask* THEN

    IF (EVEX.b == 1) THEN

    DEST[k+31:k]

    Convert_QuadInteger_To_Single_Precision_Floating_Point(SRC[63:0]) ELSE

    DEST[k+31:k]

    Convert_QuadInteger_To_Single_Precision_Floating_Point(SRC[i+63:i]) FI;

    ELSE

    IF *merging-masking* ; merging-masking THEN *DEST[k+31:k] remains unchanged*

    ELSE ; zeroing-masking

    DEST[k+31:k] 0

    FI

    FI;

    ENDFOR

    DEST[MAXVL-1:VL/2] 0


    Intel C/C++ Compiler Intrinsic Equivalent

    VCVTQQ2PS m256 _mm512_cvtepi64_ps( m512i a);

    VCVTQQ2PS m256 _mm512_mask_cvtepi64_ps( m256 s, mmask16 k, m512i a); VCVTQQ2PS m256 _mm512_maskz_cvtepi64_ps( mmask16 k, m512i a); VCVTQQ2PS m256 _mm512_cvt_roundepi64_ps( m512i a, int r);

    VCVTQQ2PS m256 _mm512_mask_cvt_roundepi_ps( m256 s, mmask8 k, m512i a, int r); VCVTQQ2PS m256 _mm512_maskz_cvt_roundepi64_ps( mmask8 k, m512i a, int r); VCVTQQ2PS m128 _mm256_cvtepi64_ps( m256i a);

    VCVTQQ2PS m128 _mm256_mask_cvtepi64_ps( m128 s, mmask8 k, m256i a); VCVTQQ2PS m128 _mm256_maskz_cvtepi64_ps( mmask8 k, m256i a); VCVTQQ2PS m128 _mm_cvtepi64_ps( m128i a);

    VCVTQQ2PS m128 _mm_mask_cvtepi64_ps( m128 s, mmask8 k, m128i a); VCVTQQ2PS m128 _mm_maskz_cvtepi64_ps( mmask8 k, m128i a);


    SIMD Floating-Point Exceptions

    Precision


    Other Exceptions

    EVEX-encoded instructions, see Exceptions Type E2

    #UD If EVEX.vvvv != 1111B.


    VCVTSD2USI—Convert Scalar Double-Precision Floating-Point Value to Unsigned Doubleword Integer

    Opcode/ Instruction

    Op / En

    64/32

    bit Mode Support

    CPUID

    Feature Flag

    Description

    EVEX.LIG.F2.0F.W0 79 /r

    VCVTSD2USI r32, xmm1/m64{er}

    A

    V/V

    AVX512F

    Convert one double-precision floating-point value from xmm1/m64 to one unsigned doubleword integer r32.

    EVEX.LIG.F2.0F.W1 79 /r

    VCVTSD2USI r64, xmm1/m64{er}

    A

    V/N.E.1

    AVX512F

    Convert one double-precision floating-point value from xmm1/m64 to one unsigned quadword integer zero- extended into r64.

    NOTES:

  2. EVEX.W1 in non-64 bit is ignored; the instructions behaves as if the W0 version is used.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Tuple1 Fixed

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Converts a double-precision floating-point value in the source operand (the second operand) to an unsigned doubleword integer in the destination operand (the first operand). The source operand can be an XMM register or a 64-bit memory location. The destination operand is a general-purpose register. When the source operand is an XMM register, the double-precision floating-point value is contained in the low quadword of the register.

When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register or the embedded rounding control bits. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.


Operation

VCVTSD2USI (EVEX encoded version)

IF (SRC *is register*) AND (EVEX.b = 1) THEN

SET_RM(EVEX.RC);

ELSE

SET_RM(MXCSR.RM);

FI;

IF 64-Bit Mode and OperandSize = 64

THEN DEST[63:0] Convert_Double_Precision_Floating_Point_To_UInteger(SRC[63:0]); ELSE DEST[31:0] Convert_Double_Precision_Floating_Point_To_UInteger(SRC[63:0]);

FI


Intel C/C++ Compiler Intrinsic Equivalent

VCVTSD2USI unsigned int _mm_cvtsd_u32( m128d); VCVTSD2USI unsigned int _mm_cvt_roundsd_u32( m128d, int r); VCVTSD2USI unsigned int64 _mm_cvtsd_u64( m128d);

VCVTSD2USI unsigned int64 _mm_cvt_roundsd_u64( m128d, int r);


SIMD Floating-Point Exceptions

Invalid, Precision


Other Exceptions

EVEX-encoded instructions, see Exceptions Type E3NF.


VCVTSS2USI—Convert Scalar Single-Precision Floating-Point Value to Unsigned Doubleword Integer

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.LIG.F3.0F.W0 79 /r

VCVTSS2USI r32, xmm1/m32{er}

A

V/V

AVX512F

Convert one single-precision floating-point value from xmm1/m32 to one unsigned doubleword integer in r32.

EVEX.LIG.F3.0F.W1 79 /r

VCVTSS2USI r64, xmm1/m32{er}

A

V/N.E.1

AVX512F

Convert one single-precision floating-point value from xmm1/m32 to one unsigned quadword integer in r64.

NOTES:

1. EVEX.W1 in non-64 bit is ignored; the instructions behaves as if the W0 version is used.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Tuple1 Fixed

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Converts a single-precision floating-point value in the source operand (the second operand) to an unsigned double- word integer (or unsigned quadword integer if operand size is 64 bits) in the destination operand (the first operand). The source operand can be an XMM register or a memory location. The destination operand is a general- purpose register. When the source operand is an XMM register, the single-precision floating-point value is contained in the low doubleword of the register.

When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register or the embedded rounding control bits. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.

VEX.W1 and EVEX.W1 versions: promotes the instruction to produce 64-bit data in 64-bit mode. Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Operation

VCVTSS2USI (EVEX encoded version)

IF (SRC *is register*) AND (EVEX.b = 1) THEN

SET_RM(EVEX.RC);

ELSE

SET_RM(MXCSR.RM);

FI;

IF 64-bit Mode and OperandSize = 64 THEN

DEST[63:0] Convert_Single_Precision_Floating_Point_To_UInteger(SRC[31:0]); ELSE

DEST[31:0] Convert_Single_Precision_Floating_Point_To_UInteger(SRC[31:0]);

FI;


Intel C/C++ Compiler Intrinsic Equivalent

VCVTSS2USI unsigned _mm_cvtss_u32( m128 a); VCVTSS2USI unsigned _mm_cvt_roundss_u32( m128 a, int r); VCVTSS2USI unsigned int64 _mm_cvtss_u64( m128 a);

VCVTSS2USI unsigned int64 _mm_cvt_roundss_u64( m128 a, int r);



SIMD Floating-Point Exceptions

Invalid, Precision


Other Exceptions

EVEX-encoded instructions, see Exceptions Type E3NF.


VCVTTPD2QQ—Convert with Truncation Packed Double-Precision Floating-Point Values to Packed Quadword Integers

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.66.0F.W1 7A /r VCVTTPD2QQ xmm1 {k1}{z},

xmm2/m128/m64bcst

A

V/V

AVX512VL AVX512DQ

Convert two packed double-precision floating-point values from zmm2/m128/m64bcst to two packed quadword integers in zmm1 using truncation with writemask k1.

EVEX.256.66.0F.W1 7A /r VCVTTPD2QQ ymm1 {k1}{z},

ymm2/m256/m64bcst

A

V/V

AVX512VL AVX512DQ

Convert four packed double-precision floating-point values from ymm2/m256/m64bcst to four packed quadword integers in ymm1 using truncation with writemask k1.

EVEX.512.66.0F.W1 7A /r VCVTTPD2QQ zmm1 {k1}{z},

zmm2/m512/m64bcst{sae}

A

V/V

AVX512DQ

Convert eight packed double-precision floating-point values from zmm2/m512 to eight packed quadword integers in zmm1 using truncation with writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Converts with truncation packed double-precision floating-point values in the source operand (second operand) to packed quadword integers in the destination operand (first operand).

EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1.

When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (2w-1, where w represents the number of bits in the destination format) is returned.

Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.


Operation

VCVTTPD2QQ (EVEX encoded version) when src operand is a register

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask* THEN DEST[i+63:i]

Convert_Double_Precision_Floating_Point_To_QuadInteger_Truncate(SRC[i+63:i]) ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VCVTTPD2QQ (EVEX encoded version) when src operand is a memory source

(KL, VL) = (2, 128), (4, 256), (8, 512)


FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask* THEN

IF (EVEX.b == 1) THEN

DEST[i+63:i] Convert_Double_Precision_Floating_Point_To_QuadInteger_Truncate(SRC[63:0]) ELSE

DEST[i+63:i] Convert_Double_Precision_Floating_Point_To_QuadInteger_Truncate(SRC[i+63:i])

FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


Intel C/C++ Compiler Intrinsic Equivalent

VCVTTPD2QQ m512i _mm512_cvttpd_epi64( m512d a);

VCVTTPD2QQ m512i _mm512_mask_cvttpd_epi64( m512i s, mmask8 k, m512d a); VCVTTPD2QQ m512i _mm512_maskz_cvttpd_epi64( mmask8 k, m512d a); VCVTTPD2QQ m512i _mm512_cvtt_roundpd_epi64( m512d a, int sae);

VCVTTPD2QQ m512i _mm512_mask_cvtt_roundpd_epi64( m512i s, mmask8 k, m512d a, int sae); VCVTTPD2QQ m512i _mm512_maskz_cvtt_roundpd_epi64( mmask8 k, m512d a, int sae); VCVTTPD2QQ m256i _mm256_mask_cvttpd_epi64( m256i s, mmask8 k, m256d a);

VCVTTPD2QQ m256i _mm256_maskz_cvttpd_epi64( mmask8 k, m256d a); VCVTTPD2QQ m128i _mm_mask_cvttpd_epi64( m128i s, mmask8 k, m128d a); VCVTTPD2QQ m128i _mm_maskz_cvttpd_epi64( mmask8 k, m128d a);


SIMD Floating-Point Exceptions

Invalid, Precision


Other Exceptions

EVEX-encoded instructions, see Exceptions Type E2.

#UD If EVEX.vvvv != 1111B.


VCVTTPD2UDQ—Convert with Truncation Packed Double-Precision Floating-Point Values to Packed Unsigned Doubleword Integers

Opcode Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.0F.W1 78 /r VCVTTPD2UDQ xmm1 {k1}{z},

xmm2/m128/m64bcst

A

V/V

AVX512VL AVX512F

Convert two packed double-precision floating-point values in xmm2/m128/m64bcst to two unsigned doubleword integers in xmm1 using truncation subject to writemask k1.

EVEX.256.0F.W1 78 02 /r VCVTTPD2UDQ xmm1 {k1}{z},

ymm2/m256/m64bcst

A

V/V

AVX512VL AVX512F

Convert four packed double-precision floating-point values in ymm2/m256/m64bcst to four unsigned doubleword integers in xmm1 using truncation subject to writemask k1.

EVEX.512.0F.W1 78 /r VCVTTPD2UDQ ymm1 {k1}{z},

zmm2/m512/m64bcst{sae}

A

V/V

AVX512F

Convert eight packed double-precision floating-point values in zmm2/m512/m64bcst to eight unsigned doubleword integers in ymm1 using truncation subject to writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Converts with truncation packed double-precision floating-point values in the source operand (the second operand) to packed unsigned doubleword integers in the destination operand (the first operand).

When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.

The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is a YMM/XMM/XMM (low 64 bits) register conditionally updated with writemask k1. The upper bits (MAXVL-1:256) of the corresponding destination are zeroed.

Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.



Operation

VCVTTPD2UDQ (EVEX encoded versions) when src2 operand is a register

(KL, VL) = (2, 128), (4, 256),(8, 512)

FOR j 0 TO KL-1

i j * 32 k j * 64

IF k1[j] OR *no writemask* THEN

DEST[i+31:i]

Convert_Double_Precision_Floating_Point_To_UInteger_Truncate(SRC[k+63:k]) ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL/2] 0


VCVTTPD2UDQ (EVEX encoded versions) when src operand is a memory source

(KL, VL) = (2, 128), (4, 256),(8, 512)


FOR j 0 TO KL-1

i j * 32 k j * 64

IF k1[j] OR *no writemask* THEN

IF (EVEX.b = 1) THEN

DEST[i+31:i]

Convert_Double_Precision_Floating_Point_To_UInteger_Truncate(SRC[63:0]) ELSE

DEST[i+31:i]

Convert_Double_Precision_Floating_Point_To_UInteger_Truncate(SRC[k+63:k]) FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL/2] 0



Intel C/C++ Compiler Intrinsic Equivalent

VCVTTPD2UDQ m256i _mm512_cvttpd_epu32( m512d a);

VCVTTPD2UDQ m256i _mm512_mask_cvttpd_epu32( m256i s, mmask8 k, m512d a); VCVTTPD2UDQ m256i _mm512_maskz_cvttpd_epu32( mmask8 k, m512d a); VCVTTPD2UDQ m256i _mm512_cvtt_roundpd_epu32( m512d a, int sae);

VCVTTPD2UDQ m256i _mm512_mask_cvtt_roundpd_epu32( m256i s, mmask8 k, m512d a, int sae); VCVTTPD2UDQ m256i _mm512_maskz_cvtt_roundpd_epu32( mmask8 k, m512d a, int sae); VCVTTPD2UDQ m128i _mm256_mask_cvttpd_epu32( m128i s, mmask8 k, m256d a); VCVTTPD2UDQ m128i _mm256_maskz_cvttpd_epu32( mmask8 k, m256d a);

VCVTTPD2UDQ m128i _mm_mask_cvttpd_epu32( m128i s, mmask8 k, m128d a); VCVTTPD2UDQ m128i _mm_maskz_cvttpd_epu32( mmask8 k, m128d a);


SIMD Floating-Point Exceptions

Invalid, Precision


Other Exceptions

EVEX-encoded instructions, see Exceptions Type E2.

#UD If EVEX.vvvv != 1111B.


VCVTTPD2UQQ—Convert with Truncation Packed Double-Precision Floating-Point Values to Packed Unsigned Quadword Integers

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.66.0F.W1 78 /r VCVTTPD2UQQ xmm1 {k1}{z},

xmm2/m128/m64bcst

A

V/V

AVX512VL AVX512DQ

Convert two packed double-precision floating-point values from xmm2/m128/m64bcst to two packed unsigned quadword integers in xmm1 using truncation with writemask k1.

EVEX.256.66.0F.W1 78 /r VCVTTPD2UQQ ymm1 {k1}{z},

ymm2/m256/m64bcst

A

V/V

AVX512VL AVX512DQ

Convert four packed double-precision floating-point values from ymm2/m256/m64bcst to four packed unsigned quadword integers in ymm1 using truncation with writemask k1.

EVEX.512.66.0F.W1 78 /r VCVTTPD2UQQ zmm1 {k1}{z},

zmm2/m512/m64bcst{sae}

A

V/V

AVX512DQ

Convert eight packed double-precision floating-point values from zmm2/mem to eight packed unsigned quadword integers in zmm1 using truncation with writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Converts with truncation packed double-precision floating-point values in the source operand (second operand) to packed unsigned quadword integers in the destination operand (first operand).

When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.

EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operation is a ZMM/YMM/XMM register conditionally updated with writemask k1.

Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.


Operation

VCVTTPD2UQQ (EVEX encoded versions) when src operand is a register

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask* THEN DEST[i+63:i]

Convert_Double_Precision_Floating_Point_To_UQuadInteger_Truncate(SRC[i+63:i]) ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VCVTTPD2UQQ (EVEX encoded versions) when src operand is a memory source

(KL, VL) = (2, 128), (4, 256), (8, 512)


FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask* THEN

IF (EVEX.b == 1) THEN

DEST[i+63:i]

Convert_Double_Precision_Floating_Point_To_UQuadInteger_Truncate(SRC[63:0]) ELSE

DEST[i+63:i]

Convert_Double_Precision_Floating_Point_To_UQuadInteger_Truncate(SRC[i+63:i]) FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


Intel C/C++ Compiler Intrinsic Equivalent

VCVTTPD2UQQ _mm<size>[_mask[z]]_cvtt[_round]pd_epu64 VCVTTPD2UQQ m512i _mm512_cvttpd_epu64( m512d a);

VCVTTPD2UQQ m512i _mm512_mask_cvttpd_epu64( m512i s, mmask8 k, m512d a); VCVTTPD2UQQ m512i _mm512_maskz_cvttpd_epu64( mmask8 k, m512d a); VCVTTPD2UQQ m512i _mm512_cvtt_roundpd_epu64( m512d a, int sae);

VCVTTPD2UQQ m512i _mm512_mask_cvtt_roundpd_epu64( m512i s, mmask8 k, m512d a, int sae); VCVTTPD2UQQ m512i _mm512_maskz_cvtt_roundpd_epu64( mmask8 k, m512d a, int sae); VCVTTPD2UQQ m256i _mm256_mask_cvttpd_epu64( m256i s, mmask8 k, m256d a); VCVTTPD2UQQ m256i _mm256_maskz_cvttpd_epu64( mmask8 k, m256d a);

VCVTTPD2UQQ m128i _mm_mask_cvttpd_epu64( m128i s, mmask8 k, m128d a); VCVTTPD2UQQ m128i _mm_maskz_cvttpd_epu64( mmask8 k, m128d a);


SIMD Floating-Point Exceptions

Invalid, Precision


Other Exceptions

EVEX-encoded instructions, see Exceptions Type E2.

#UD If EVEX.vvvv != 1111B.


VCVTTPS2UDQ—Convert with Truncation Packed Single-Precision Floating-Point Values to Packed Unsigned Doubleword Integer Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.0F.W0 78 /r VCVTTPS2UDQ xmm1 {k1}{z},

xmm2/m128/m32bcst

A

V/V

AVX512VL AVX512F

Convert four packed single precision floating-point values from xmm2/m128/m32bcst to four packed unsigned doubleword values in xmm1 using truncation subject to writemask k1.

EVEX.256.0F.W0 78 /r VCVTTPS2UDQ ymm1 {k1}{z},

ymm2/m256/m32bcst

A

V/V

AVX512VL AVX512F

Convert eight packed single precision floating-point values from ymm2/m256/m32bcst to eight packed unsigned doubleword values in ymm1 using truncation subject to writemask k1.

EVEX.512.0F.W0 78 /r VCVTTPS2UDQ zmm1 {k1}{z},

zmm2/m512/m32bcst{sae}

A

V/V

AVX512F

Convert sixteen packed single-precision floating- point values from zmm2/m512/m32bcst to sixteen packed unsigned doubleword values in zmm1 using truncation subject to writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Converts with truncation packed single-precision floating-point values in the source operand to sixteen unsigned doubleword integers in the destination operand.

When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.

EVEX encoded versions: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1.

Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.


Operation

VCVTTPS2UDQ (EVEX encoded versions) when src operand is a register

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask* THEN DEST[i+31:i]

Convert_Single_Precision_Floating_Point_To_UInteger_Truncate(SRC[i+31:i]) ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VCVTTPS2UDQ (EVEX encoded versions) when src operand is a memory source

(KL, VL) = (4, 128), (8, 256), (16, 512)


FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask* THEN

IF (EVEX.b = 1) THEN

DEST[i+31:i]

Convert_Single_Precision_Floating_Point_To_UInteger_Truncate(SRC[31:0]) ELSE

DEST[i+31:i]

Convert_Single_Precision_Floating_Point_To_UInteger_Truncate(SRC[i+31:i]) FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


Intel C/C++ Compiler Intrinsic Equivalent

VCVTTPS2UDQ m512i _mm512_cvttps_epu32( m512 a);

VCVTTPS2UDQ m512i _mm512_mask_cvttps_epu32( m512i s, mmask16 k, m512 a); VCVTTPS2UDQ m512i _mm512_maskz_cvttps_epu32( mmask16 k, m512 a); VCVTTPS2UDQ m512i _mm512_cvtt_roundps_epu32( m512 a, int sae);

VCVTTPS2UDQ m512i _mm512_mask_cvtt_roundps_epu32( m512i s, mmask16 k, m512 a, int sae); VCVTTPS2UDQ m512i _mm512_maskz_cvtt_roundps_epu32( mmask16 k, m512 a, int sae); VCVTTPS2UDQ m256i _mm256_mask_cvttps_epu32( m256i s, mmask8 k, m256 a);

VCVTTPS2UDQ m256i _mm256_maskz_cvttps_epu32( mmask8 k, m256 a); VCVTTPS2UDQ m128i _mm_mask_cvttps_epu32( m128i s, mmask8 k, m128 a); VCVTTPS2UDQ m128i _mm_maskz_cvttps_epu32( mmask8 k, m128 a);


SIMD Floating-Point Exceptions

Invalid, Precision


Other Exceptions

EVEX-encoded instructions, see Exceptions Type E2.

#UD If EVEX.vvvv != 1111B.


VCVTTPS2QQ—Convert with Truncation Packed Single Precision Floating-Point Values to Packed Singed Quadword Integer Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.66.0F.W0 7A /r VCVTTPS2QQ xmm1 {k1}{z},

xmm2/m64/m32bcst

A

V/V

AVX512VL AVX512DQ

Convert two packed single precision floating-point values from xmm2/m64/m32bcst to two packed signed quadword values in xmm1 using truncation subject to writemask k1.

EVEX.256.66.0F.W0 7A /r VCVTTPS2QQ ymm1 {k1}{z},

xmm2/m128/m32bcst

A

V/V

AVX512VL AVX512DQ

Convert four packed single precision floating-point values from xmm2/m128/m32bcst to four packed signed quadword values in ymm1 using truncation subject to writemask k1.

EVEX.512.66.0F.W0 7A /r VCVTTPS2QQ zmm1 {k1}{z},

ymm2/m256/m32bcst{sae}

A

V/V

AVX512DQ

Convert eight packed single precision floating-point values from ymm2/m256/m32bcst to eight packed signed quadword values in zmm1 using truncation subject to writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Half

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Converts with truncation packed single-precision floating-point values in the source operand to eight signed quad- word integers in the destination operand.

When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (2w-1, where w represents the number of bits in the destination format) is returned.

EVEX encoded versions: The source operand is a YMM/XMM/XMM (low 64 bits) register or a 256/128/64-bit memory location. The destination operation is a vector register conditionally updated with writemask k1.

Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.


Operation

VCVTTPS2QQ (EVEX encoded versions) when src operand is a register

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64 k j * 32

IF k1[j] OR *no writemask* THEN DEST[i+63:i]

Convert_Single_Precision_To_QuadInteger_Truncate(SRC[k+31:k]) ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VCVTTPS2QQ (EVEX encoded versions) when src operand is a memory source

(KL, VL) = (2, 128), (4, 256), (8, 512)


FOR j 0 TO KL-1

i j * 64 k j * 32

IF k1[j] OR *no writemask* THEN

IF (EVEX.b == 1) THEN

DEST[i+63:i]

Convert_Single_Precision_To_QuadInteger_Truncate(SRC[31:0]) ELSE

DEST[i+63:i]

Convert_Single_Precision_To_QuadInteger_Truncate(SRC[k+31:k]) FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


Intel C/C++ Compiler Intrinsic Equivalent

VCVTTPS2QQ m512i _mm512_cvttps_epi64( m256 a);

VCVTTPS2QQ m512i _mm512_mask_cvttps_epi64( m512i s, mmask16 k, m256 a); VCVTTPS2QQ m512i _mm512_maskz_cvttps_epi64( mmask16 k, m256 a); VCVTTPS2QQ m512i _mm512_cvtt_roundps_epi64( m256 a, int sae);

VCVTTPS2QQ m512i _mm512_mask_cvtt_roundps_epi64( m512i s, mmask16 k, m256 a, int sae); VCVTTPS2QQ m512i _mm512_maskz_cvtt_roundps_epi64( mmask16 k, m256 a, int sae); VCVTTPS2QQ m256i _mm256_mask_cvttps_epi64( m256i s, mmask8 k, m128 a);

VCVTTPS2QQ m256i _mm256_maskz_cvttps_epi64( mmask8 k, m128 a); VCVTTPS2QQ m128i _mm_mask_cvttps_epi64( m128i s, mmask8 k, m128 a); VCVTTPS2QQ m128i _mm_maskz_cvttps_epi64( mmask8 k, m128 a);


SIMD Floating-Point Exceptions

Invalid, Precision


Other Exceptions

EVEX-encoded instructions, see Exceptions Type E3.

#UD If EVEX.vvvv != 1111B.


VCVTTPS2UQQ—Convert with Truncation Packed Single Precision Floating-Point Values to Packed Unsigned Quadword Integer Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.66.0F.W0 78 /r VCVTTPS2UQQ xmm1 {k1}{z},

xmm2/m64/m32bcst

A

V/V

AVX512VL AVX512DQ

Convert two packed single precision floating-point values from xmm2/m64/m32bcst to two packed unsigned quadword values in xmm1 using truncation subject to writemask k1.

EVEX.256.66.0F.W0 78 /r VCVTTPS2UQQ ymm1 {k1}{z},

xmm2/m128/m32bcst

A

V/V

AVX512VL AVX512DQ

Convert four packed single precision floating-point values from xmm2/m128/m32bcst to four packed unsigned quadword values in ymm1 using truncation subject to writemask k1.

EVEX.512.66.0F.W0 78 /r VCVTTPS2UQQ zmm1 {k1}{z},

ymm2/m256/m32bcst{sae}

A

V/V

AVX512DQ

Convert eight packed single precision floating-point values from ymm2/m256/m32bcst to eight packed unsigned quadword values in zmm1 using truncation subject to writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Half

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Converts with truncation up to eight packed single-precision floating-point values in the source operand to unsigned quadword integers in the destination operand.

When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.

EVEX encoded versions: The source operand is a YMM/XMM/XMM (low 64 bits) register or a 256/128/64-bit memory location. The destination operation is a vector register conditionally updated with writemask k1.

Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.


Operation

VCVTTPS2UQQ (EVEX encoded versions) when src operand is a register

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64 k j * 32

IF k1[j] OR *no writemask* THEN DEST[i+63:i]

Convert_Single_Precision_To_UQuadInteger_Truncate(SRC[k+31:k]) ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VCVTTPS2UQQ (EVEX encoded versions) when src operand is a memory source

(KL, VL) = (2, 128), (4, 256), (8, 512)


FOR j 0 TO KL-1

i j * 64 k j * 32

IF k1[j] OR *no writemask* THEN

IF (EVEX.b == 1) THEN

DEST[i+63:i]

Convert_Single_Precision_To_UQuadInteger_Truncate(SRC[31:0]) ELSE

DEST[i+63:i]

Convert_Single_Precision_To_UQuadInteger_Truncate(SRC[k+31:k]) FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


Intel C/C++ Compiler Intrinsic Equivalent

VCVTTPS2UQQ _mm<size>[_mask[z]]_cvtt[_round]ps_epu64 VCVTTPS2UQQ m512i _mm512_cvttps_epu64( m256 a);

VCVTTPS2UQQ m512i _mm512_mask_cvttps_epu64( m512i s, mmask16 k, m256 a); VCVTTPS2UQQ m512i _mm512_maskz_cvttps_epu64( mmask16 k, m256 a); VCVTTPS2UQQ m512i _mm512_cvtt_roundps_epu64( m256 a, int sae);

VCVTTPS2UQQ m512i _mm512_mask_cvtt_roundps_epu64( m512i s, mmask16 k, m256 a, int sae); VCVTTPS2UQQ m512i _mm512_maskz_cvtt_roundps_epu64( mmask16 k, m256 a, int sae); VCVTTPS2UQQ m256i _mm256_mask_cvttps_epu64( m256i s, mmask8 k, m128 a);

VCVTTPS2UQQ m256i _mm256_maskz_cvttps_epu64( mmask8 k, m128 a); VCVTTPS2UQQ m128i _mm_mask_cvttps_epu64( m128i s, mmask8 k, m128 a); VCVTTPS2UQQ m128i _mm_maskz_cvttps_epu64( mmask8 k, m128 a);


SIMD Floating-Point Exceptions

Invalid, Precision


Other Exceptions

EVEX-encoded instructions, see Exceptions Type E3.

#UD If EVEX.vvvv != 1111B.


VCVTTSD2USI—Convert with Truncation Scalar Double-Precision Floating-Point Value to Unsigned Integer

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.LIG.F2.0F.W0 78 /r

VCVTTSD2USI r32, xmm1/m64{sae}

A

V/V

AVX512F

Convert one double-precision floating-point value from xmm1/m64 to one unsigned doubleword integer r32 using truncation.

EVEX.LIG.F2.0F.W1 78 /r

VCVTTSD2USI r64, xmm1/m64{sae}

A

V/N.E.1

AVX512F

Convert one double-precision floating-point value from xmm1/m64 to one unsigned quadword integer zero- extended into r64 using truncation.

NOTES:

1. For this specific instruction, EVEX.W in non-64 bit is ignored; the instructions behaves as if the W0 version is used.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Tuple1 Fixed

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Converts with truncation a double-precision floating-point value in the source operand (the second operand) to an unsigned doubleword integer (or unsigned quadword integer if operand size is 64 bits) in the destination operand (the first operand). The source operand can be an XMM register or a 64-bit memory location. The destination operand is a general-purpose register. When the source operand is an XMM register, the double-precision floating- point value is contained in the low quadword of the register.

When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.

EVEX.W1 version: promotes the instruction to produce 64-bit data in 64-bit mode.


Operation

VCVTTSD2USI (EVEX encoded version)

IF 64-Bit Mode and OperandSize = 64

THEN DEST[63:0] Convert_Double_Precision_Floating_Point_To_UInteger_Truncate(SRC[63:0]); ELSE DEST[31:0] Convert_Double_Precision_Floating_Point_To_UInteger_Truncate(SRC[63:0]);

FI


Intel C/C++ Compiler Intrinsic Equivalent

VCVTTSD2USI unsigned int _mm_cvttsd_u32( m128d); VCVTTSD2USI unsigned int _mm_cvtt_roundsd_u32( m128d, int sae); VCVTTSD2USI unsigned int64 _mm_cvttsd_u64( m128d);

VCVTTSD2USI unsigned int64 _mm_cvtt_roundsd_u64( m128d, int sae);


SIMD Floating-Point Exceptions

Invalid, Precision


Other Exceptions

EVEX-encoded instructions, see Exceptions Type E3NF.


VCVTTSS2USI—Convert with Truncation Scalar Single-Precision Floating-Point Value to Unsigned Integer

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.LIG.F3.0F.W0 78 /r

VCVTTSS2USI r32, xmm1/m32{sae}

A

V/V

AVX512F

Convert one single-precision floating-point value from xmm1/m32 to one unsigned doubleword integer in r32 using truncation.

EVEX.LIG.F3.0F.W1 78 /r

VCVTTSS2USI r64, xmm1/m32{sae}

A

V/N.E.1

AVX512F

Convert one single-precision floating-point value from xmm1/m32 to one unsigned quadword integer in r64 using truncation.

NOTES:

1. For this specific instruction, EVEX.W in non-64 bit is ignored; the instructions behaves as if the W0 version is used.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Tuple1 Fixed

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Converts with truncation a single-precision floating-point value in the source operand (the second operand) to an unsigned doubleword integer (or unsigned quadword integer if operand size is 64 bits) in the destination operand (the first operand). The source operand can be an XMM register or a memory location. The destination operand is a general-purpose register. When the source operand is an XMM register, the single-precision floating-point value is contained in the low doubleword of the register.

When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.

EVEX.W1 version: promotes the instruction to produce 64-bit data in 64-bit mode. Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.



Operation

VCVTTSS2USI (EVEX encoded version) IF 64-bit Mode and OperandSize = 64 THEN

DEST[63:0] Convert_Single_Precision_Floating_Point_To_UInteger_Truncate(SRC[31:0]); ELSE

DEST[31:0] Convert_Single_Precision_Floating_Point_To_UInteger_Truncate(SRC[31:0]);

FI;


Intel C/C++ Compiler Intrinsic Equivalent

VCVTTSS2USI unsigned int _mm_cvttss_u32( m128 a);

VCVTTSS2USI unsigned int _mm_cvtt_roundss_u32( m128 a, int sae); VCVTTSS2USI unsigned int64 _mm_cvttss_u64( m128 a);

VCVTTSS2USI unsigned int64 _mm_cvtt_roundss_u64( m128 a, int sae);


SIMD Floating-Point Exceptions

Invalid, Precision


Other Exceptions

EVEX-encoded instructions, see Exceptions Type E3NF.


VCVTUDQ2PD—Convert Packed Unsigned Doubleword Integers to Packed Double-Precision Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.F3.0F.W0 7A /r VCVTUDQ2PD xmm1 {k1}{z},

xmm2/m64/m32bcst

A

V/V

AVX512VL AVX512F

Convert two packed unsigned doubleword integers from ymm2/m64/m32bcst to packed double-precision floating-point values in zmm1 with writemask k1.

EVEX.256.F3.0F.W0 7A /r VCVTUDQ2PD ymm1 {k1}{z},

xmm2/m128/m32bcst

A

V/V

AVX512VL AVX512F

Convert four packed unsigned doubleword integers from xmm2/m128/m32bcst to packed double- precision floating-point values in zmm1 with writemask k1.

EVEX.512.F3.0F.W0 7A /r VCVTUDQ2PD zmm1 {k1}{z},

ymm2/m256/m32bcst

A

V/V

AVX512F

Convert eight packed unsigned doubleword integers from ymm2/m256/m32bcst to eight packed double- precision floating-point values in zmm1 with writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Half

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Converts packed unsigned doubleword integers in the source operand (second operand) to packed double-preci- sion floating-point values in the destination operand (first operand).

The source operand is a YMM/XMM/XMM (low 64 bits) register, a 256/128/64-bit memory location or a 256/128/64-bit vector broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1.

Attempt to encode this instruction with EVEX embedded rounding is ignored. Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Operation

VCVTUDQ2PD (EVEX encoded versions) when src operand is a register

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64 k j * 32

IF k1[j] OR *no writemask* THEN DEST[i+63:i]

Convert_UInteger_To_Double_Precision_Floating_Point(SRC[k+31:k]) ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VCVTUDQ2PD (EVEX encoded versions) when src operand is a memory source

(KL, VL) = (2, 128), (4, 256), (8, 512)


FOR j 0 TO KL-1

i j * 64 k j * 32

IF k1[j] OR *no writemask* THEN

IF (EVEX.b = 1) THEN

DEST[i+63:i]

Convert_UInteger_To_Double_Precision_Floating_Point(SRC[31:0]) ELSE

DEST[i+63:i]

Convert_UInteger_To_Double_Precision_Floating_Point(SRC[k+31:k]) FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


Intel C/C++ Compiler Intrinsic Equivalent

VCVTUDQ2PD m512d _mm512_cvtepu32_pd( m256i a);

VCVTUDQ2PD m512d _mm512_mask_cvtepu32_pd( m512d s, mmask8 k, m256i a); VCVTUDQ2PD m512d _mm512_maskz_cvtepu32_pd( mmask8 k, m256i a); VCVTUDQ2PD m256d _mm256_cvtepu32_pd( m128i a);

VCVTUDQ2PD m256d _mm256_mask_cvtepu32_pd( m256d s, mmask8 k, m128i a); VCVTUDQ2PD m256d _mm256_maskz_cvtepu32_pd( mmask8 k, m128i a); VCVTUDQ2PD m128d _mm_cvtepu32_pd( m128i a);

VCVTUDQ2PD m128d _mm_mask_cvtepu32_pd( m128d s, mmask8 k, m128i a); VCVTUDQ2PD m128d _mm_maskz_cvtepu32_pd( mmask8 k, m128i a);


SIMD Floating-Point Exceptions

None


Other Exceptions

EVEX-encoded instructions, see Exceptions Type E5.

#UD If EVEX.vvvv != 1111B.


VCVTUDQ2PS—Convert Packed Unsigned Doubleword Integers to Packed Single-Precision Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.F2.0F.W0 7A /r VCVTUDQ2PS xmm1 {k1}{z},

xmm2/m128/m32bcst

A

V/V

AVX512VL AVX512F

Convert four packed unsigned doubleword integers from xmm2/m128/m32bcst to packed single-precision floating-point values in xmm1 with writemask k1.

EVEX.256.F2.0F.W0 7A /r VCVTUDQ2PS ymm1 {k1}{z},

ymm2/m256/m32bcst

A

V/V

AVX512VL AVX512F

Convert eight packed unsigned doubleword integers from ymm2/m256/m32bcst to packed single-precision floating-point values in zmm1 with writemask k1.

EVEX.512.F2.0F.W0 7A /r VCVTUDQ2PS zmm1 {k1}{z},

zmm2/m512/m32bcst{er}

A

V/V

AVX512F

Convert sixteen packed unsigned doubleword integers from zmm2/m512/m32bcst to sixteen packed single- precision floating-point values in zmm1 with writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Converts packed unsigned doubleword integers in the source operand (second operand) to single-precision floating-point values in the destination operand (first operand).

The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1.

Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.


Operation

VCVTUDQ2PS (EVEX encoded version) when src operand is a register

(KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1)

THEN

SET_RM(EVEX.RC);

ELSE

SET_RM(MXCSR.RM);

FI;


FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask* THEN DEST[i+31:i]

Convert_UInteger_To_Single_Precision_Floating_Point(SRC[i+31:i]) ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VCVTUDQ2PS (EVEX encoded version) when src operand is a memory source

(KL, VL) = (4, 128), (8, 256), (16, 512)


FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask* THEN

IF (EVEX.b = 1) THEN

DEST[i+31:i]

Convert_UInteger_To_Single_Precision_Floating_Point(SRC[31:0]) ELSE

DEST[i+31:i]

Convert_UInteger_To_Single_Precision_Floating_Point(SRC[i+31:i]) FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


Intel C/C++ Compiler Intrinsic Equivalent

VCVTUDQ2PS m512 _mm512_cvtepu32_ps( m512i a);

VCVTUDQ2PS m512 _mm512_mask_cvtepu32_ps( m512 s, mmask16 k, m512i a); VCVTUDQ2PS m512 _mm512_maskz_cvtepu32_ps( mmask16 k, m512i a); VCVTUDQ2PS m512 _mm512_cvt_roundepu32_ps( m512i a, int r);

VCVTUDQ2PS m512 _mm512_mask_cvt_roundepu32_ps( m512 s, mmask16 k, m512i a, int r); VCVTUDQ2PS m512 _mm512_maskz_cvt_roundepu32_ps( mmask16 k, m512i a, int r); VCVTUDQ2PS m256 _mm256_cvtepu32_ps( m256i a);

VCVTUDQ2PS m256 _mm256_mask_cvtepu32_ps( m256 s, mmask8 k, m256i a); VCVTUDQ2PS m256 _mm256_maskz_cvtepu32_ps( mmask8 k, m256i a); VCVTUDQ2PS m128 _mm_cvtepu32_ps( m128i a);

VCVTUDQ2PS m128 _mm_mask_cvtepu32_ps( m128 s, mmask8 k, m128i a); VCVTUDQ2PS m128 _mm_maskz_cvtepu32_ps( mmask8 k, m128i a);


SIMD Floating-Point Exceptions

Precision


Other Exceptions

EVEX-encoded instructions, see Exceptions Type E2.

#UD If EVEX.vvvv != 1111B.


VCVTUQQ2PD—Convert Packed Unsigned Quadword Integers to Packed Double-Precision Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.F3.0F.W1 7A /r VCVTUQQ2PD xmm1 {k1}{z},

xmm2/m128/m64bcst

A

V/V

AVX512VL AVX512DQ

Convert two packed unsigned quadword integers from xmm2/m128/m64bcst to two packed double-precision floating-point values in xmm1 with writemask k1.

EVEX.256.F3.0F.W1 7A /r VCVTUQQ2PD ymm1 {k1}{z},

ymm2/m256/m64bcst

A

V/V

AVX512VL AVX512DQ

Convert four packed unsigned quadword integers from ymm2/m256/m64bcst to packed double-precision floating- point values in ymm1 with writemask k1.

EVEX.512.F3.0F.W1 7A /r VCVTUQQ2PD zmm1 {k1}{z},

zmm2/m512/m64bcst{er}

A

V/V

AVX512DQ

Convert eight packed unsigned quadword integers from zmm2/m512/m64bcst to eight packed double-precision floating-point values in zmm1 with writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Converts packed unsigned quadword integers in the source operand (second operand) to packed double-precision floating-point values in the destination operand (first operand).

The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1.

Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.


Operation

VCVTUQQ2PD (EVEX encoded version) when src operand is a register

(KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL == 512) AND (EVEX.b == 1)

THEN

SET_RM(EVEX.RC);

ELSE

SET_RM(MXCSR.RM);

FI;


FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask* THEN DEST[i+63:i]

Convert_UQuadInteger_To_Double_Precision_Floating_Point(SRC[i+63:i]) ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VCVTUQQ2PD (EVEX encoded version) when src operand is a memory source

(KL, VL) = (2, 128), (4, 256), (8, 512)


FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask* THEN

IF (EVEX.b == 1) THEN

DEST[i+63:i]

Convert_UQuadInteger_To_Double_Precision_Floating_Point(SRC[63:0]) ELSE

DEST[i+63:i]

Convert_UQuadInteger_To_Double_Precision_Floating_Point(SRC[i+63:i]) FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


Intel C/C++ Compiler Intrinsic Equivalent

VCVTUQQ2PD m512d _mm512_cvtepu64_ps( m512i a);

VCVTUQQ2PD m512d _mm512_mask_cvtepu64_ps( m512d s, mmask8 k, m512i a); VCVTUQQ2PD m512d _mm512_maskz_cvtepu64_ps( mmask8 k, m512i a); VCVTUQQ2PD m512d _mm512_cvt_roundepu64_ps( m512i a, int r);

VCVTUQQ2PD m512d _mm512_mask_cvt_roundepu64_ps( m512d s, mmask8 k, m512i a, int r); VCVTUQQ2PD m512d _mm512_maskz_cvt_roundepu64_ps( mmask8 k, m512i a, int r); VCVTUQQ2PD m256d _mm256_cvtepu64_ps( m256i a);

VCVTUQQ2PD m256d _mm256_mask_cvtepu64_ps( m256d s, mmask8 k, m256i a); VCVTUQQ2PD m256d _mm256_maskz_cvtepu64_ps( mmask8 k, m256i a); VCVTUQQ2PD m128d _mm_cvtepu64_ps( m128i a);

VCVTUQQ2PD m128d _mm_mask_cvtepu64_ps( m128d s, mmask8 k, m128i a); VCVTUQQ2PD m128d _mm_maskz_cvtepu64_ps( mmask8 k, m128i a);


SIMD Floating-Point Exceptions

Precision


Other Exceptions

EVEX-encoded instructions, see Exceptions Type E2.

#UD If EVEX.vvvv != 1111B.


VCVTUQQ2PS—Convert Packed Unsigned Quadword Integers to Packed Single-Precision Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.F2.0F.W1 7A /r VCVTUQQ2PS xmm1 {k1}{z},

xmm2/m128/m64bcst

A

V/V

AVX512VL AVX512DQ

Convert two packed unsigned quadword integers from xmm2/m128/m64bcst to packed single-precision floating- point values in zmm1 with writemask k1.

EVEX.256.F2.0F.W1 7A /r VCVTUQQ2PS xmm1 {k1}{z},

ymm2/m256/m64bcst

A

V/V

AVX512VL AVX512DQ

Convert four packed unsigned quadword integers from ymm2/m256/m64bcst to packed single-precision floating- point values in xmm1 with writemask k1.

EVEX.512.F2.0F.W1 7A /r VCVTUQQ2PS ymm1 {k1}{z},

zmm2/m512/m64bcst{er}

A

V/V

AVX512DQ

Convert eight packed unsigned quadword integers from zmm2/m512/m64bcst to eight packed single-precision floating-point values in zmm1 with writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Converts packed unsigned quadword integers in the source operand (second operand) to single-precision floating- point values in the destination operand (first operand).

EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand is a YMM/XMM/XMM (low 64 bits) register conditionally updated with writemask k1.

Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.


Operation

VCVTUQQ2PS (EVEX encoded version) when src operand is a register

(KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1)

THEN

SET_RM(EVEX.RC);

ELSE

SET_RM(MXCSR.RM);

FI;


FOR j 0 TO KL-1

i j * 32 k j * 64

IF k1[j] OR *no writemask* THEN DEST[i+31:i]

Convert_UQuadInteger_To_Single_Precision_Floating_Point(SRC[k+63:k]) ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL/2] 0



VCVTUQQ2PS (EVEX encoded version) when src operand is a memory source

(KL, VL) = (2, 128), (4, 256), (8, 512)


FOR j 0 TO KL-1

i j * 32 k j * 64

IF k1[j] OR *no writemask* THEN

IF (EVEX.b = 1) THEN

DEST[i+31:i]

Convert_UQuadInteger_To_Single_Precision_Floating_Point(SRC[63:0]) ELSE

DEST[i+31:i]

Convert_UQuadInteger_To_Single_Precision_Floating_Point(SRC[k+63:k]) FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL/2] 0


Intel C/C++ Compiler Intrinsic Equivalent

VCVTUQQ2PS m256 _mm512_cvtepu64_ps( m512i a);

VCVTUQQ2PS m256 _mm512_mask_cvtepu64_ps( m256 s, mmask8 k, m512i a); VCVTUQQ2PS m256 _mm512_maskz_cvtepu64_ps( mmask8 k, m512i a); VCVTUQQ2PS m256 _mm512_cvt_roundepu64_ps( m512i a, int r);

VCVTUQQ2PS m256 _mm512_mask_cvt_roundepu64_ps( m256 s, mmask8 k, m512i a, int r); VCVTUQQ2PS m256 _mm512_maskz_cvt_roundepu64_ps( mmask8 k, m512i a, int r); VCVTUQQ2PS m128 _mm256_cvtepu64_ps( m256i a);

VCVTUQQ2PS m128 _mm256_mask_cvtepu64_ps( m128 s, mmask8 k, m256i a); VCVTUQQ2PS m128 _mm256_maskz_cvtepu64_ps( mmask8 k, m256i a); VCVTUQQ2PS m128 _mm_cvtepu64_ps( m128i a);

VCVTUQQ2PS m128 _mm_mask_cvtepu64_ps( m128 s, mmask8 k, m128i a); VCVTUQQ2PS m128 _mm_maskz_cvtepu64_ps( mmask8 k, m128i a);


SIMD Floating-Point Exceptions


Precision


Other Exceptions

EVEX-encoded instructions, see Exceptions Type E2.

#UD If EVEX.vvvv != 1111B.


VCVTUSI2SD—Convert Unsigned Integer to Scalar Double-Precision Floating-Point Value

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.NDS.LIG.F2.0F.W0 7B /r

VCVTUSI2SD xmm1, xmm2, r/m32

A

V/V

AVX512F

Convert one unsigned doubleword integer from r/m32 to one double-precision floating-point value in xmm1.

EVEX.NDS.LIG.F2.0F.W1 7B /r

VCVTUSI2SD xmm1, xmm2, r/m64{er}

A

V/N.E.1

AVX512F

Convert one unsigned quadword integer from r/m64 to one double-precision floating-point value in xmm1.

NOTES:

1. For this specific instruction, EVEX.W in non-64 bit is ignored; the instructions behaves as if the W0 version is used.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Tuple1 Scalar

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

NA

Description

Converts an unsigned doubleword integer (or unsigned quadword integer if operand size is 64 bits) in the second source operand to a double-precision floating-point value in the destination operand. The result is stored in the low quadword of the destination operand. When conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register.

The second source operand can be a general-purpose register or a 32/64-bit memory location. The first source and destination operands are XMM registers. Bits (127:64) of the XMM register destination are copied from corre- sponding bits in the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.

EVEX.W1 version: promotes the instruction to use 64-bit input value in 64-bit mode.

EVEX.W0 version: attempt to encode this instruction with EVEX embedded rounding is ignored.


Operation

VCVTUSI2SD (EVEX encoded version)

IF (SRC2 *is register*) AND (EVEX.b = 1) THEN

SET_RM(EVEX.RC);

ELSE

SET_RM(MXCSR.RM);

FI;

IF 64-Bit Mode And OperandSize = 64 THEN

DEST[63:0] Convert_UInteger_To_Double_Precision_Floating_Point(SRC2[63:0]); ELSE

DEST[63:0] Convert_UInteger_To_Double_Precision_Floating_Point(SRC2[31:0]);

FI;

DEST[127:64] SRC1[127:64] DEST[MAXVL-1:128] 0



Intel C/C++ Compiler Intrinsic Equivalent

VCVTUSI2SD m128d _mm_cvtu32_sd( m128d s, unsigned a); VCVTUSI2SD m128d _mm_cvtu64_sd( m128d s, unsigned int64 a);

VCVTUSI2SD m128d _mm_cvt_roundu64_sd( m128d s, unsigned int64 a, int r);


SIMD Floating-Point Exceptions

Precision


Other Exceptions

See Exceptions Type E3NF if W1, else type E10NF.


VCVTUSI2SS—Convert Unsigned Integer to Scalar Single-Precision Floating-Point Value

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.NDS.LIG.F3.0F.W0 7B /r

VCVTUSI2SS xmm1, xmm2, r/m32{er}

A

V/V

AVX512F

Convert one signed doubleword integer from r/m32 to one single-precision floating-point value in xmm1.

EVEX.NDS.LIG.F3.0F.W1 7B /r

VCVTUSI2SS xmm1, xmm2, r/m64{er}

A

V/N.E.1

AVX512F

Convert one signed quadword integer from r/m64 to one single-precision floating-point value in xmm1.

NOTES:

1. For this specific instruction, EVEX.W in non-64 bit is ignored; the instructions behaves as if the W0 version is used.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Tuple1 Scalar

ModRM:reg (w)

VEX.vvvv

ModRM:r/m (r)

NA

Description

Converts a unsigned doubleword integer (or unsigned quadword integer if operand size is 64 bits) in the source operand (second operand) to a single-precision floating-point value in the destination operand (first operand). The source operand can be a general-purpose register or a memory location. The destination operand is an XMM register. The result is stored in the low doubleword of the destination operand. When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register or the embedded rounding control bits.

The second source operand can be a general-purpose register or a 32/64-bit memory location. The first source and destination operands are XMM registers. Bits (127:32) of the XMM register destination are copied from corre- sponding bits in the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.

EVEX.W1 version: promotes the instruction to use 64-bit input value in 64-bit mode.


Operation

VCVTUSI2SS (EVEX encoded version)

IF (SRC2 *is register*) AND (EVEX.b = 1) THEN

SET_RM(EVEX.RC);

ELSE

SET_RM(MXCSR.RM);

FI;

IF 64-Bit Mode And OperandSize = 64 THEN

DEST[31:0] Convert_UInteger_To_Single_Precision_Floating_Point(SRC[63:0]); ELSE

DEST[31:0] Convert_UInteger_To_Single_Precision_Floating_Point(SRC[31:0]);

FI;

DEST[127:32] SRC1[127:32] DEST[MAXVL-1:128] 0


Intel C/C++ Compiler Intrinsic Equivalent

VCVTUSI2SS m128 _mm_cvtu32_ss( m128 s, unsigned a); VCVTUSI2SS m128 _mm_cvt_roundu32_ss( m128 s, unsigned a, int r); VCVTUSI2SS m128 _mm_cvtu64_ss( m128 s, unsigned int64 a);

VCVTUSI2SS m128 _mm_cvt_roundu64_ss( m128 s, unsigned int64 a, int r);



SIMD Floating-Point Exceptions

Precision


Other Exceptions

See Exceptions Type E3NF.


VDBPSADBW—Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.NDS.128.66.0F3A.W0 42 /r ib

VDBPSADBW xmm1 {k1}{z}, xmm2, xmm3/m128, imm8

A

V/V

AVX512VL AVX512BW

Compute packed SAD word results of unsigned bytes in dword block from xmm2 with unsigned bytes of dword blocks transformed from xmm3/m128 using the shuffle controls in imm8. Results are written to xmm1 under the writemask k1.

EVEX.NDS.256.66.0F3A.W0 42 /r ib

VDBPSADBW ymm1 {k1}{z}, ymm2, ymm3/m256, imm8

A

V/V

AVX512VL AVX512BW

Compute packed SAD word results of unsigned bytes in dword block from ymm2 with unsigned bytes of dword blocks transformed from ymm3/m256 using the shuffle controls in imm8. Results are written to ymm1 under the writemask k1.

EVEX.NDS.512.66.0F3A.W0 42 /r ib

VDBPSADBW zmm1 {k1}{z}, zmm2, zmm3/m512, imm8

A

V/V

AVX512BW

Compute packed SAD word results of unsigned bytes in dword block from zmm2 with unsigned bytes of dword blocks transformed from zmm3/m512 using the shuffle controls in imm8. Results are written to zmm1 under the writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full Mem

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

Imm8

Description

Compute packed SAD (sum of absolute differences) word results of unsigned bytes from two 32-bit dword elements. Packed SAD word results are calculated in multiples of qword superblocks, producing 4 SAD word results in each 64-bit superblock of the destination register.

Within each super block of packed word results, the SAD results from two 32-bit dword elements are calculated as follows:

Note that the presence of VSIB byte is enforced in this instruction. Hence, the instruction will #UD fault if ModRM.rm is different than 100b.

This instruction has special disp8*N and alignment rules. N is considered to be the size of a single vector element.

The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are ignored.

The instruction will #UD fault if the destination vector zmm1 is the same as index vector VINDEX. The instruction will #UD fault if the k0 mask register is specified.


Operation

BASE_ADDR stands for the memory operand base address (a GPR); may not exist VINDEX stands for the memory operand vector of indices (a ZMM register) SCALE stands for the memory operand scalar (1, 2, 4 or 8)

DISP is the optional 1, 2 or 4 byte displacement


VGATHERQPS (EVEX encoded version)

(KL, VL) = (2, 64), (4, 128), (8, 256)

FOR j 0 TO KL-1


i j * 32 k j * 64

IF k1[j] OR *no writemask* THEN DEST[i+31:i]

MEM[BASE_ADDR + (VINDEX[k+63:k]) * SCALE + DISP]

k1[j] 0

ELSE *DEST[i+31:i] remains unchanged*

FI;

ENDFOR

k1[MAX_KL-1:KL] 0

DEST[MAXVL-1:VL/2] 0


VGATHERQPD (EVEX encoded version)


(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] MEM[BASE_ADDR + (VINDEX[i+63:i]) * SCALE + DISP]

k1[j] 0

ELSE *DEST[i+63:i] remains unchanged*

FI;

ENDFOR

k1[MAX_KL-1:KL] 0

DEST[MAXVL-1:VL] 0



Intel C/C++ Compiler Intrinsic Equivalent

VGATHERQPD m512d _mm512_i64gather_pd( m512i vdx, void * base, int scale);

VGATHERQPD m512d _mm512_mask_i64gather_pd( m512d s, mmask8 k, m512i vdx, void * base, int scale); VGATHERQPD m256d _mm256_mask_i64gather_pd( m256d s, mmask8 k, m256i vdx, void * base, int scale); VGATHERQPD m128d _mm_mask_i64gather_pd( m128d s, mmask8 k, m128i vdx, void * base, int scale); VGATHERQPS m256 _mm512_i64gather_ps( m512i vdx, void * base, int scale);

VGATHERQPS m256 _mm512_mask_i64gather_ps( m256 s, mmask16 k, m512i vdx, void * base, int scale); VGATHERQPS m128 _mm256_mask_i64gather_ps( m128 s, mmask8 k, m256i vdx, void * base, int scale); VGATHERQPS m128 _mm_mask_i64gather_ps( m128 s, mmask8 k, m128i vdx, void * base, int scale);


SIMD Floating-Point Exceptions

None


Other Exceptions

See Exceptions Type E12.


VGETEXPPD—Convert Exponents of Packed DP FP Values to DP FP Values

Opcode/ Instruction

Op/ En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.66.0F38.W1 42 /r VGETEXPPD xmm1 {k1}{z},

xmm2/m128/m64bcst

A

V/V

AVX512VL AVX512F

Convert the exponent of packed double-precision floating-point values in the source operand to DP FP results representing unbiased integer exponents and stores the results in the destination register.

EVEX.256.66.0F38.W1 42 /r VGETEXPPD ymm1 {k1}{z},

ymm2/m256/m64bcst

A

V/V

AVX512VL AVX512F

Convert the exponent of packed double-precision floating-point values in the source operand to DP FP results representing unbiased integer exponents and stores the results in the destination register.

EVEX.512.66.0F38.W1 42 /r VGETEXPPD zmm1 {k1}{z},

zmm2/m512/m64bcst{sae}

A

V/V

AVX512F

Convert the exponent of packed double-precision floating-point values in the source operand to DP FP results representing unbiased integer exponents and stores the results in the destination under writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Extracts the biased exponents from the normalized DP FP representation of each qword data element of the source operand (the second operand) as unbiased signed integer value, or convert the denormal representation of input data to unbiased negative integer values. Each integer value of the unbiased exponent is converted to double- precision FP value and written to the corresponding qword elements of the destination operand (the first operand) as DP FP numbers.

The destination operand is a ZMM/YMM/XMM register and updated under the writemask. The source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 64-bit memory location.

EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Each GETEXP operation converts the exponent value into a FP number (permitting input value in denormal repre- sentation). Special cases of input values are listed in Table 5-5.

The formula is:

GETEXP(x) = floor(log2(|x|))

Notation floor(x) stands for the greatest integer not exceeding real number x.


Table 5-5. VGETEXPPD/SD Special Cases

Input Operand

Result

Comments

src1 = NaN

QNaN(src1)


If (SRC = SNaN) then #IE

If (SRC = denormal) then #DE

0 < |src1| < INF

floor(log2(|src1|))

| src1| = +INF

+INF

| src1| = 0

-INF



Operation

NormalizeExpTinyDPFP(SRC[63:0])

{

// Jbit is the hidden integral bit of a FP number. In case of denormal number it has the value of ZERO. Src.Jbit 0;

Dst.exp 1;

Dst.fraction SRC[51:0]; WHILE(Src.Jbit = 0)

{

Src.Jbit Dst.fraction[51]; // Get the fraction MSB Dst.fraction Dst.fraction << 1 ; // One bit shift left Dst.exp-- ; // Decrement the exponent

}

Dst.fraction 0; // zero out fraction bits

Dst.sign 1; // Return negative sign

TMP[63:0] MXCSR.DAZ? 0 : (Dst.sign << 63) OR (Dst.exp << 52) OR (Dst.fraction) ; Return (TMP[63:0]);

}


ConvertExpDPFP(SRC[63:0])

{

Src.sign 0; // Zero out sign bit Src.exp SRC[62:52];

Src.fraction SRC[51:0];

// Check for NaN IF (SRC = NaN)

{

IF ( SRC = SNAN ) SET IE;

Return QNAN(SRC);

}

// Check for +INF

IF (SRC = +INF) Return (SRC);


// check if zero operand

IF ((Src.exp = 0) AND ((Src.fraction = 0) OR (MXCSR.DAZ = 1))) Return (-INF);

}

ELSE // check if denormal operand (notice that MXCSR.DAZ = 0)

{

IF ((Src.exp = 0) AND (Src.fraction != 0))

{

TMP[63:0] NormalizeExpTinyDPFP(SRC[63:0]) ; // Get Normalized Exponent Set #DE

}

ELSE // exponent value is correct

{

TMP[63:0] (Src.sign << 63) OR (Src.exp << 52) OR (Src.fraction) ;

}

TMP SAR(TMP, 52) ; // Shift Arithmetic Right TMP TMP – 1023; // Subtract Bias

Return CvtI2D(TMP); // Convert INT to Double-Precision FP number

}

}



VGETEXPPD (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask* THEN

IF (EVEX.b = 1) AND (SRC *is memory*) THEN

DEST[i+63:i]

ConvertExpDPFP(SRC[63:0]) ELSE

DEST[i+63:i]

ConvertExpDPFP(SRC[i+63:i]) FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


Intel C/C++ Compiler Intrinsic Equivalent

VGETEXPPD m512d _mm512_getexp_pd( m512d a);

VGETEXPPD m512d _mm512_mask_getexp_pd( m512d s, mmask8 k, m512d a); VGETEXPPD m512d _mm512_maskz_getexp_pd( mmask8 k, m512d a); VGETEXPPD m512d _mm512_getexp_round_pd( m512d a, int sae);

VGETEXPPD m512d _mm512_mask_getexp_round_pd( m512d s, mmask8 k, m512d a, int sae); VGETEXPPD m512d _mm512_maskz_getexp_round_pd( mmask8 k, m512d a, int sae); VGETEXPPD m256d _mm256_getexp_pd( m256d a);

VGETEXPPD m256d _mm256_mask_getexp_pd( m256d s, mmask8 k, m256d a); VGETEXPPD m256d _mm256_maskz_getexp_pd( mmask8 k, m256d a); VGETEXPPD m128d _mm_getexp_pd( m128d a);

VGETEXPPD m128d _mm_mask_getexp_pd( m128d s, mmask8 k, m128d a); VGETEXPPD m128d _mm_maskz_getexp_pd( mmask8 k, m128d a);


SIMD Floating-Point Exceptions

Invalid, Denormal


Other Exceptions

See Exceptions Type E2.

#UD If EVEX.vvvv != 1111B.


VGETEXPPS—Convert Exponents of Packed SP FP Values to SP FP Values

Opcode/ Instruction

Op/ En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.66.0F38.W0 42 /r VGETEXPPS xmm1 {k1}{z},

xmm2/m128/m32bcst

A

V/V

AVX512VL AVX512F

Convert the exponent of packed single-precision floating-point values in the source operand to SP FP results representing unbiased integer exponents and stores the results in the destination register.

EVEX.256.66.0F38.W0 42 /r VGETEXPPS ymm1 {k1}{z},

ymm2/m256/m32bcst

A

V/V

AVX512VL AVX512F

Convert the exponent of packed single-precision floating-point values in the source operand to SP FP results representing unbiased integer exponents and stores the results in the destination register.

EVEX.512.66.0F38.W0 42 /r VGETEXPPS zmm1 {k1}{z},

zmm2/m512/m32bcst{sae}

A

V/V

AVX512F

Convert the exponent of packed single-precision floating-point values in the source operand to SP FP results representing unbiased integer exponents and stores the results in the destination register.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Extracts the biased exponents from the normalized SP FP representation of each dword element of the source operand (the second operand) as unbiased signed integer value, or convert the denormal representation of input data to unbiased negative integer values. Each integer value of the unbiased exponent is converted to single-preci- sion FP value and written to the corresponding dword elements of the destination operand (the first operand) as SP FP numbers.

The destination operand is a ZMM/YMM/XMM register and updated under the writemask. The source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 32-bit memory location.

EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Each GETEXP operation converts the exponent value into a FP number (permitting input value in denormal repre- sentation). Special cases of input values are listed in Table 5-6.

The formula is:

GETEXP(x) = floor(log2(|x|))

Notation floor(x) stands for maximal integer not exceeding real number x.

Software usage of VGETEXPxx and VGETMANTxx instructions generally involve a combination of GETEXP operation and GETMANT operation (see VGETMANTPD). Thus VGETEXPxx instruction do not require software to handle SIMD FP exceptions.


Table 5-6. VGETEXPPS/SS Special Cases

Input Operand

Result

Comments

src1 = NaN

QNaN(src1)


If (SRC = SNaN) then #IE

If (SRC = denormal) then #DE

0 < |src1| < INF

floor(log2(|src1|))

| src1| = +INF

+INF

| src1| = 0

-INF



image

Figure 5-14 illustrates the VGETEXPPS functionality on input values with normalized representation.




31

30

29

28

27

26

25

24

23

22

21

20

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0


s

exp

Fraction

Src = 2^1

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0


































SAR Src, 23 = 080h

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0


































-Bias

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

0

0

1


































Tmp - Bias = 1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1


































Cvt_PI2PS(01h) = 2^0

0

0

1

1

1

1

1

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0



































Figure 5-14. VGETEXPPS Functionality On Normal Input values



Operation

NormalizeExpTinySPFP(SRC[31:0])

{

// Jbit is the hidden integral bit of a FP number. In case of denormal number it has the value of ZERO. Src.Jbit 0;

Dst.exp 1;

Dst.fraction SRC[22:0]; WHILE(Src.Jbit = 0)

{

Src.Jbit Dst.fraction[22]; // Get the fraction MSB Dst.fraction Dst.fraction << 1 ; // One bit shift left Dst.exp-- ; // Decrement the exponent

}

Dst.fraction 0; // zero out fraction bits

Dst.sign 1; // Return negative sign

TMP[31:0] MXCSR.DAZ? 0 : (Dst.sign << 31) OR (Dst.exp << 23) OR (Dst.fraction) ; Return (TMP[31:0]);

}

ConvertExpSPFP(SRC[31:0])

{

Src.sign 0; // Zero out sign bit Src.exp SRC[30:23];

Src.fraction SRC[22:0];

// Check for NaN IF (SRC = NaN)

{

IF ( SRC = SNAN ) SET IE;

Return QNAN(SRC);

}

// Check for +INF

IF (SRC = +INF) Return (SRC);


// check if zero operand

IF ((Src.exp = 0) AND ((Src.fraction = 0) OR (MXCSR.DAZ = 1))) Return (-INF);

}

ELSE // check if denormal operand (notice that MXCSR.DAZ = 0)

{



IF ((Src.exp = 0) AND (Src.fraction != 0))

{

TMP[31:0] NormalizeExpTinySPFP(SRC[31:0]) ; // Get Normalized Exponent Set #DE

}

ELSE // exponent value is correct

{

TMP[31:0] (Src.sign << 31) OR (Src.exp << 23) OR (Src.fraction) ;

}

TMP SAR(TMP, 23) ; // Shift Arithmetic Right TMP TMP – 127; // Subtract Bias

Return CvtI2D(TMP); // Convert INT to Single-Precision FP number

}

}


VGETEXPPS (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask* THEN

IF (EVEX.b = 1) AND (SRC *is memory*) THEN

DEST[i+31:i]

ConvertExpSPFP(SRC[31:0]) ELSE

DEST[i+31:i]

ConvertExpSPFP(SRC[i+31:i]) FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



Intel C/C++ Compiler Intrinsic Equivalent

VGETEXPPS m512 _mm512_getexp_ps( m512 a);

VGETEXPPS m512 _mm512_mask_getexp_ps( m512 s, mmask16 k, m512 a); VGETEXPPS m512 _mm512_maskz_getexp_ps( mmask16 k, m512 a); VGETEXPPS m512 _mm512_getexp_round_ps( m512 a, int sae);

VGETEXPPS m512 _mm512_mask_getexp_round_ps( m512 s, mmask16 k, m512 a, int sae); VGETEXPPS m512 _mm512_maskz_getexp_round_ps( mmask16 k, m512 a, int sae); VGETEXPPS m256 _mm256_getexp_ps( m256 a);

VGETEXPPS m256 _mm256_mask_getexp_ps( m256 s, mmask8 k, m256 a); VGETEXPPS m256 _mm256_maskz_getexp_ps( mmask8 k, m256 a); VGETEXPPS m128 _mm_getexp_ps( m128 a);

VGETEXPPS m128 _mm_mask_getexp_ps( m128 s, mmask8 k, m128 a); VGETEXPPS m128 _mm_maskz_getexp_ps( mmask8 k, m128 a);


SIMD Floating-Point Exceptions

Invalid, Denormal


Other Exceptions

See Exceptions Type E2.

#UD If EVEX.vvvv != 1111B.


VGETEXPSD—Convert Exponents of Scalar DP FP Values to DP FP Value

Opcode/ Instruction

Op/ En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.NDS.LIG.66.0F38.W1 43 /r VGETEXPSD xmm1 {k1}{z},

xmm2, xmm3/m64{sae}

A

V/V

AVX512F

Convert the biased exponent (bits 62:52) of the low double- precision floating-point value in xmm3/m64 to a DP FP value representing unbiased integer exponent. Stores the result to the low 64-bit of xmm1 under the writemask k1 and merge with the other elements of xmm2.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Tuple1 Scalar

ModRM:reg (w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA

Description

Extracts the biased exponent from the normalized DP FP representation of the low qword data element of the source operand (the third operand) as unbiased signed integer value, or convert the denormal representation of input data to unbiased negative integer values. The integer value of the unbiased exponent is converted to double- precision FP value and written to the destination operand (the first operand) as DP FP numbers. Bits (127:64) of the XMM register destination are copied from corresponding bits in the first source operand.

The destination must be a XMM register, the source operand can be a XMM register or a float64 memory location. The low quadword element of the destination operand is conditionally updated with writemask k1.

Each GETEXP operation converts the exponent value into a FP number (permitting input value in denormal repre- sentation). Special cases of input values are listed in Table 5-5.

The formula is:

GETEXP(x) = floor(log2(|x|))

Notation floor(x) stands for maximal integer not exceeding real number x.


Operation

// NormalizeExpTinyDPFP(SRC[63:0]) is defined in the Operation section of VGETEXPPD


// ConvertExpDPFP(SRC[63:0]) is defined in the Operation section of VGETEXPPD


VGETEXPSD (EVEX encoded version)

IF k1[0] OR *no writemask* THEN DEST[63:0]

ConvertExpDPFP(SRC2[63:0])

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged*

ELSE ; zeroing-masking

DEST[63:0] 0

FI

FI;

DEST[127:64] SRC1[127:64] DEST[MAXVL-1:128] 0



Intel C/C++ Compiler Intrinsic Equivalent

VGETEXPSD m128d _mm_getexp_sd( m128d a, m128d b);

VGETEXPSD m128d _mm_mask_getexp_sd( m128d s, mmask8 k, m128d a, m128d b); VGETEXPSD m128d _mm_maskz_getexp_sd( mmask8 k, m128d a, m128d b); VGETEXPSD m128d _mm_getexp_round_sd( m128d a, m128d b, int sae);

VGETEXPSD m128d _mm_mask_getexp_round_sd( m128d s, mmask8 k, m128d a, m128d b, int sae); VGETEXPSD m128d _mm_maskz_getexp_round_sd( mmask8 k, m128d a, m128d b, int sae);


SIMD Floating-Point Exceptions

Invalid, Denormal


Other Exceptions

See Exceptions Type E3.


VGETEXPSS—Convert Exponents of Scalar SP FP Values to SP FP Value

Opcode/ Instruction

Op/ En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.NDS.LIG.66.0F38.W0 43 /r

VGETEXPSS xmm1 {k1}{z}, xmm2, xmm3/m32{sae}

A

V/V

AVX512F

Convert the biased exponent (bits 30:23) of the low single- precision floating-point value in xmm3/m32 to a SP FP value representing unbiased integer exponent. Stores the result to xmm1 under the writemask k1 and merge with the other elements of xmm2.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Tuple1 Scalar

ModRM:reg (w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA

Description

Extracts the biased exponent from the normalized SP FP representation of the low doubleword data element of the source operand (the third operand) as unbiased signed integer value, or convert the denormal representation of input data to unbiased negative integer values. The integer value of the unbiased exponent is converted to single- precision FP value and written to the destination operand (the first operand) as SP FP numbers. Bits (127:32) of the XMM register destination are copied from corresponding bits in the first source operand.

The destination must be a XMM register, the source operand can be a XMM register or a float32 memory location. The the low doubleword element of the destination operand is conditionally updated with writemask k1.

Each GETEXP operation converts the exponent value into a FP number (permitting input value in denormal repre- sentation). Special cases of input values are listed in Table 5-6.

The formula is:

GETEXP(x) = floor(log2(|x|))

Notation floor(x) stands for maximal integer not exceeding real number x.

Software usage of VGETEXPxx and VGETMANTxx instructions generally involve a combination of GETEXP operation and GETMANT operation (see VGETMANTPD). Thus VGETEXPxx instruction do not require software to handle SIMD FP exceptions.


Operation

// NormalizeExpTinySPFP(SRC[31:0]) is defined in the Operation section of VGETEXPPS


// ConvertExpSPFP(SRC[31:0]) is defined in the Operation section of VGETEXPPS


VGETEXPSS (EVEX encoded version)

IF k1[0] OR *no writemask* THEN DEST[31:0]

ConvertExpDPFP(SRC2[31:0])

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged*

ELSE ; zeroing-masking

DEST[31:0] 0

FI

FI;

ENDFOR

DEST[127:32] SRC1[127:32] DEST[MAXVL-1:128] 0



Intel C/C++ Compiler Intrinsic Equivalent

VGETEXPSS m128 _mm_getexp_ss( m128 a, m128 b);

VGETEXPSS m128 _mm_mask_getexp_ss( m128 s, mmask8 k, m128 a, m128 b); VGETEXPSS m128 _mm_maskz_getexp_ss( mmask8 k, m128 a, m128 b); VGETEXPSS m128 _mm_getexp_round_ss( m128 a, m128 b, int sae);

VGETEXPSS m128 _mm_mask_getexp_round_ss( m128 s, mmask8 k, m128 a, m128 b, int sae); VGETEXPSS m128 _mm_maskz_getexp_round_ss( mmask8 k, m128 a, m128 b, int sae);


SIMD Floating-Point Exceptions

Invalid, Denormal


Other Exceptions

See Exceptions Type E3.


VGETMANTPD—Extract Float64 Vector of Normalized Mantissas from Float64 Vector

Opcode/ Instruction

Op/ En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.66.0F3A.W1 26 /r ib VGETMANTPD xmm1 {k1}{z},

xmm2/m128/m64bcst, imm8

A

V/V

AVX512VL AVX512F

Get Normalized Mantissa from float64 vector xmm2/m128/m64bcst and store the result in xmm1, using imm8 for sign control and mantissa interval normalization, under writemask.

EVEX.256.66.0F3A.W1 26 /r ib VGETMANTPD ymm1 {k1}{z},

ymm2/m256/m64bcst, imm8

A

V/V

AVX512VL AVX512F

Get Normalized Mantissa from float64 vector ymm2/m256/m64bcst and store the result in ymm1, using imm8 for sign control and mantissa interval normalization, under writemask.

EVEX.512.66.0F3A.W1 26 /r ib VGETMANTPD zmm1 {k1}{z},

zmm2/m512/m64bcst{sae}, imm8

A

V/V

AVX512F

Get Normalized Mantissa from float64 vector zmm2/m512/m64bcst and store the result in zmm1, using imm8 for sign control and mantissa interval normalization, under writemask.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full

ModRM:reg (w)

ModRM:r/m (r)

Imm8

NA

Description

Convert double-precision floating values in the source operand (the second operand) to DP FP values with the mantissa normalization and sign control specified by the imm8 byte, see Figure 5-15. The converted results are written to the destination operand (the first operand) using writemask k1. The normalized mantissa is specified by interv (imm8[1:0]) and the sign control (sc) is specified by bits 3:2 of the immediate byte.

Normaiization Interval

Sign Control (SC)

The destination operand is a ZMM/YMM/XMM register updated under the writemask. The source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 64- bit memory location.


imm8

7

6


Must Be Zero

5

4

3

2

1

0



Imm8[3:2] = 00b : sign(SRC) Imm8[3:2] = 01b : 0

Imm8[3] = 1b : qNan_Indefinite if sign(SRC) != 0, regardless of imm8[2].

Imm8[1:0] = 00b : Interval is [ 1, 2) Imm8[1:0] = 01b : Interval is [1/2, 2) Imm8[1:0] = 10b : Interval is [ 1/2, 1)

Imm8[1:0] = 11b : Interval is [3/4, 3/2)


image

Figure 5-15. Imm8 Controls for VGETMANTPD/SD/PS/SS



For each input DP FP value x, The conversion operation is:

GetMant(x) = ±2k|x.significand|

where:


1 <= |x.significand| < 2


Unbiased exponent k depends on the interval range defined by interv and whether the exponent of the source is even or odd. The sign of the final result is determined by sc and the source sign.


If interv != 0 then k = -1, otherwise K = 0. The encoded value of imm8[1:0] and sign control are shown in Figure 5-15.

Each converted DP FP result is encoded according to the sign control, the unbiased exponent k (adding bias) and a mantissa normalized to the range specified by interv.

The GetMant() function follows Table 5-7 when dealing with floating-point special numbers.

This instruction is writemasked, so only those elements with the corresponding bit set in vector mask register k1 are computed and stored into the destination. Elements in zmm1 with the corresponding bit clear in k1 retain their previous values.

Note: EVEX.vvvv is reserved and must be 1111b; otherwise instructions will #UD.


Table 5-7. GetMant() Special Float Values Behavior

Input

Result

Exceptions / Comments

NaN

QNaN(SRC)

Ignore interv

If (SRC = SNaN) then #IE

+∞

1.0

Ignore interv

+0

1.0

Ignore interv

-0

IF (SC[0]) THEN +1.0

ELSE -1.0

Ignore interv

-∞

IF (SC[1]) THEN {QNaN_Indefinite} ELSE {

IF (SC[0]) THEN +1.0

ELSE -1.0

Ignore interv

If (SC[1]) then #IE

negative

SC[1] ? QNaN_Indefinite : Getmant(SRC)

If (SC[1]) then #IE

Operation

GetNormalizeMantissaDP(SRC[63:0], SignCtrl[1:0], Interv[1:0])

{

// Extracting the SRC sign, exponent and mantissa fields

Dst.sign SignCtrl[0] ? 0 : Src[63]; // Get sign bit Dst.exp SRC[62:52]; ; Get original exponent value

Dst.fraction SRC[51:0];; Get original fraction value ZeroOperand (Dst.exp = 0) AND (Dst.fraction = 0); DenormOperand (Dst.exp = 0h) AND (Dst.fraction != 0); InfiniteOperand (Dst.exp = 07FFh) AND (Dst.fraction = 0); NaNOperand (Dst.exp = 07FFh) AND (Dst.fraction != 0);

// Check for NAN operand IF (NaNOperand)

{ IF (SRC = SNaN) {Set #IE;}

Return QNAN(SRC);

}

// Check for Zero and Infinite operands IF ((ZeroOperand) OR (InfiniteOperand)

{ Dst.exp 03FFh; // Override exponent with BIAS Return ((Dst.sign<<63) | (Dst.exp<<52) | (Dst.fraction));

}

// Check for negative operand (including -0.0) IF ((Src[63] = 1) AND SignCtrl[1])

{ Set #IE;

Return QNaN_Indefinite;

}



// Checking for denormal operands IF (DenormOperand)

{ IF (MXCSR.DAZ=1) Dst.fraction 0;// Zero out fraction ELSE

{ // Jbit is the hidden integral bit. Zero in case of denormal operand.

Src.Jbit 0; // Zero Src Jbit

Dst.exp 03FFh; // Override exponent with BIAS WHILE (Src.Jbit = 0) { // normalize mantissa

Src.Jbit Dst.fraction[51]; // Get the fraction MSB

Dst.fraction (Dst.fraction << 1); // Start normalizing the mantissa Dst.exp--; // Adjust the exponent

}

SET #DE; // Set DE bit

}

} // At this point, Dst.fraction is normalized.

// Checking for exponent response

Unbiased.exp Dst.exp – 03FFh; // subtract the bias from exponent IsOddExp Unbiased.exp[0]; // recognized unbiased ODD exponent SignalingBit Dst.fraction[51];

CASE (interv[1:0])

00: Dst.exp 03FFh; // This is the bias

01: Dst.exp (IsOddExp) ? 03FEh : 03FFh; // either bias-1, or bias 10: Dst.exp 03FEh; // bias-1

11: Dst.exp (SignalingBit) ? 03FEh : 03FFh; // either bias-1, or bias ESAC

// At this point Dst.exp has the correct result. Form the final destination DEST[63:0] (Dst.sign << 63) OR (Dst.exp << 52) OR (Dst.fraction); Return (DEST);

}


VGETMANTPD (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) SignCtrl[1:0] IMM8[3:2];

Interv[1:0] IMM8[1:0];

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask* THEN

IF (EVEX.b = 1) AND (SRC *is memory*) THEN

DEST[i+63:i] GetNormalizedMantissaDP(SRC[63:0], SignCtrl, Interv)

ELSE

DEST[i+63:i] GetNormalizedMantissaDP(SRC[i+63:i], SignCtrl, Interv)

FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



Intel C/C++ Compiler Intrinsic Equivalent

VGETMANTPD m512d _mm512_getmant_pd( m512d a, enum intv, enum sgn);

VGETMANTPD m512d _mm512_mask_getmant_pd( m512d s, mmask8 k, m512d a, enum intv, enum sgn); VGETMANTPD m512d _mm512_maskz_getmant_pd( mmask8 k, m512d a, enum intv, enum sgn); VGETMANTPD m512d _mm512_getmant_round_pd( m512d a, enum intv, enum sgn, int r);

VGETMANTPD m512d _mm512_mask_getmant_round_pd( m512d s, mmask8 k, m512d a, enum intv, enum sgn, int r); VGETMANTPD m512d _mm512_maskz_getmant_round_pd( mmask8 k, m512d a, enum intv, enum sgn, int r); VGETMANTPD m256d _mm256_getmant_pd( m256d a, enum intv, enum sgn);

VGETMANTPD m256d _mm256_mask_getmant_pd( m256d s, mmask8 k, m256d a, enum intv, enum sgn); VGETMANTPD m256d _mm256_maskz_getmant_pd( mmask8 k, m256d a, enum intv, enum sgn); VGETMANTPD m128d _mm_getmant_pd( m128d a, enum intv, enum sgn);

VGETMANTPD m128d _mm_mask_getmant_pd( m128d s, mmask8 k, m128d a, enum intv, enum sgn); VGETMANTPD m128d _mm_maskz_getmant_pd( mmask8 k, m128d a, enum intv, enum sgn);


SIMD Floating-Point Exceptions

Denormal, Invalid


Other Exceptions

See Exceptions Type E2.

#UD If EVEX.vvvv != 1111B.


VGETMANTPS—Extract Float32 Vector of Normalized Mantissas from Float32 Vector

Opcode/ Instruction

Op/ En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.66.0F3A.W0 26 /r ib VGETMANTPS xmm1 {k1}{z},

xmm2/m128/m32bcst, imm8

A

V/V

AVX512VL AVX512F

Get normalized mantissa from float32 vector xmm2/m128/m32bcst and store the result in xmm1, using imm8 for sign control and mantissa interval normalization, under writemask.

EVEX.256.66.0F3A.W0 26 /r ib VGETMANTPS ymm1 {k1}{z},

ymm2/m256/m32bcst, imm8

A

V/V

AVX512VL AVX512F

Get normalized mantissa from float32 vector ymm2/m256/m32bcst and store the result in ymm1, using imm8 for sign control and mantissa interval normalization, under writemask.

EVEX.512.66.0F3A.W0 26 /r ib VGETMANTPS zmm1 {k1}{z},

zmm2/m512/m32bcst{sae}, imm8

A

V/V

AVX512F

Get normalized mantissa from float32 vector zmm2/m512/m32bcst and store the result in zmm1, using imm8 for sign control and mantissa interval normalization, under writemask.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full

ModRM:reg (w)

ModRM:r/m (r)

Imm8

NA

Description

Convert single-precision floating values in the source operand (the second operand) to SP FP values with the mantissa normalization and sign control specified by the imm8 byte, see Figure 5-15. The converted results are written to the destination operand (the first operand) using writemask k1. The normalized mantissa is specified by interv (imm8[1:0]) and the sign control (sc) is specified by bits 3:2 of the immediate byte.

The destination operand is a ZMM/YMM/XMM register updated under the writemask. The source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 32- bit memory location.

For each input SP FP value x, The conversion operation is:

GetMant(x) = ±2k|x.significand|

where:


1 <= |x.significand| < 2


Unbiased exponent k depends on the interval range defined by interv and whether the exponent of the source is even or odd. The sign of the final result is determined by sc and the source sign.

if interv != 0 then k = -1, otherwise K = 0. The encoded value of imm8[1:0] and sign control are shown in Figure 5-15.

Each converted SP FP result is encoded according to the sign control, the unbiased exponent k (adding bias) and a mantissa normalized to the range specified by interv.

The GetMant() function follows Table 5-7 when dealing with floating-point special numbers.

This instruction is writemasked, so only those elements with the corresponding bit set in vector mask register k1 are computed and stored into the destination. Elements in zmm1 with the corresponding bit clear in k1 retain their previous values.

Note: EVEX.vvvv is reserved and must be 1111b, VEX.L must be 0; otherwise instructions will #UD.



Operation

GetNormalizeMantissaSP(SRC[31:0] , SignCtrl[1:0], Interv[1:0])

{

// Extracting the SRC sign, exponent and mantissa fields

Dst.sign SignCtrl[0] ? 0 : Src[31]; // Get sign bit Dst.exp SRC[30:23]; ; Get original exponent value

Dst.fraction SRC[22:0];; Get original fraction value ZeroOperand (Dst.exp = 0) AND (Dst.fraction = 0); DenormOperand (Dst.exp = 0h) AND (Dst.fraction != 0); InfiniteOperand (Dst.exp = 0FFh) AND (Dst.fraction = 0); NaNOperand (Dst.exp = 0FFh) AND (Dst.fraction != 0);

// Check for NAN operand IF (NaNOperand)

{ IF (SRC = SNaN) {Set #IE;}

Return QNAN(SRC);

}

// Check for Zero and Infinite operands IF ((ZeroOperand) OR (InfiniteOperand)

{ Dst.exp 07Fh; // Override exponent with BIAS Return ((Dst.sign<<31) | (Dst.exp<<23) | (Dst.fraction));

}

// Check for negative operand (including -0.0) IF ((Src[31] = 1) AND SignCtrl[1])

{ Set #IE;

Return QNaN_Indefinite;

}

// Checking for denormal operands IF (DenormOperand)

{ IF (MXCSR.DAZ=1) Dst.fraction 0;// Zero out fraction ELSE

{ // Jbit is the hidden integral bit. Zero in case of denormal operand.

Src.Jbit 0; // Zero Src Jbit

Dst.exp 07Fh; // Override exponent with BIAS WHILE (Src.Jbit = 0) { // normalize mantissa

Src.Jbit Dst.fraction[22]; // Get the fraction MSB

Dst.fraction (Dst.fraction << 1); // Start normalizing the mantissa Dst.exp--; // Adjust the exponent

}

SET #DE; // Set DE bit

}

} // At this point, Dst.fraction is normalized.

// Checking for exponent response

Unbiased.exp Dst.exp – 07Fh; // subtract the bias from exponent IsOddExp Unbiased.exp[0]; // recognized unbiased ODD exponent SignalingBit Dst.fraction[22];

CASE (interv[1:0])

00: Dst.exp 07Fh; // This is the bias

01: Dst.exp (IsOddExp) ? 07Eh : 07Fh; // either bias-1, or bias 10: Dst.exp 07Eh; // bias-1

11: Dst.exp (SignalingBit) ? 07Eh : 07Fh; // either bias-1, or bias ESAC


// Form the final destination

DEST[31:0] (Dst.sign << 31) OR (Dst.exp << 23) OR (Dst.fraction);


VGETMANTPS—Extract Float32 Vector of Normalized Mantissas from Float32 Vector Vol. 2C 5-277



Return (DEST);

}


VGETMANTPS (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) SignCtrl[1:0] IMM8[3:2];

Interv[1:0] IMM8[1:0];

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask* THEN

IF (EVEX.b = 1) AND (SRC *is memory*) THEN

DEST[i+31:i] GetNormalizedMantissaSP(SRC[31:0], SignCtrl, Interv)

ELSE

DEST[i+31:i] GetNormalizedMantissaSP(SRC[i+31:i], SignCtrl, Interv)

FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


Intel C/C++ Compiler Intrinsic Equivalent

VGETMANTPS m512 _mm512_getmant_ps( m512 a, enum intv, enum sgn);

VGETMANTPS m512 _mm512_mask_getmant_ps( m512 s, mmask16 k, m512 a, enum intv, enum sgn; VGETMANTPS m512 _mm512_maskz_getmant_ps( mmask16 k, m512 a, enum intv, enum sgn); VGETMANTPS m512 _mm512_getmant_round_ps( m512 a, enum intv, enum sgn, int r);

VGETMANTPS m512 _mm512_mask_getmant_round_ps( m512 s, mmask16 k, m512 a, enum intv, enum sgn, int r); VGETMANTPS m512 _mm512_maskz_getmant_round_ps( mmask16 k, m512 a, enum intv, enum sgn, int r); VGETMANTPS m256 _mm256_getmant_ps( m256 a, enum intv, enum sgn);

VGETMANTPS m256 _mm256_mask_getmant_ps( m256 s, mmask8 k, m256 a, enum intv, enum sgn); VGETMANTPS m256 _mm256_maskz_getmant_ps( mmask8 k, m256 a, enum intv, enum sgn); VGETMANTPS m128 _mm_getmant_ps( m128 a, enum intv, enum sgn);

VGETMANTPS m128 _mm_mask_getmant_ps( m128 s, mmask8 k, m128 a, enum intv, enum sgn); VGETMANTPS m128 _mm_maskz_getmant_ps( mmask8 k, m128 a, enum intv, enum sgn);


SIMD Floating-Point Exceptions

Denormal, Invalid


Other Exceptions

See Exceptions Type E2.

#UD If EVEX.vvvv != 1111B.


VGETMANTSD—Extract Float64 of Normalized Mantissas from Float64 Scalar

Opcode/ Instruction

Op/ En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.NDS.LIG.66.0F3A.W1 27 /r ib VGETMANTSD xmm1 {k1}{z}, xmm2,

xmm3/m64{sae}, imm8

A

V/V

AVX512F

Extract the normalized mantissa of the low float64 element in xmm3/m64 using imm8 for sign control and mantissa interval normalization. Store the mantissa to xmm1 under the writemask k1 and merge with the other elements of xmm2.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Tuple1 Scalar

ModRM:reg (w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA

Description

Convert the double-precision floating values in the low quadword element of the second source operand (the third operand) to DP FP value with the mantissa normalization and sign control specified by the imm8 byte, see

Figure 5-15. The converted result is written to the low quadword element of the destination operand (the first operand) using writemask k1. Bits (127:64) of the XMM register destination are copied from corresponding bits in the first source operand. The normalized mantissa is specified by interv (imm8[1:0]) and the sign control (sc) is specified by bits 3:2 of the immediate byte.

The conversion operation is:

GetMant(x) = ±2k|x.significand|

where:


1 <= |x.significand| < 2


Unbiased exponent k depends on the interval range defined by interv and whether the exponent of the source is even or odd. The sign of the final result is determined by sc and the source sign.

If interv != 0 then k = -1, otherwise K = 0. The encoded value of imm8[1:0] and sign control are shown in Figure 5-15.

The converted DP FP result is encoded according to the sign control, the unbiased exponent k (adding bias) and a mantissa normalized to the range specified by interv.

The GetMant() function follows Table 5-7 when dealing with floating-point special numbers.

This instruction is writemasked, so only those elements with the corresponding bit set in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with the corresponding bit clear in k1 retain their previous values.



Operation

// GetNormalizeMantissaDP(SRC[63:0], SignCtrl[1:0], Interv[1:0]) is defined in the operation section of VGETMANTPD


VGETMANTSD (EVEX encoded version)

SignCtrl[1:0] IMM8[3:2];

Interv[1:0] IMM8[1:0];

IF k1[0] OR *no writemask* THEN DEST[63:0]

GetNormalizedMantissaDP(SRC2[63:0], SignCtrl, Interv)

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged*

ELSE ; zeroing-masking

DEST[63:0] 0

FI

FI;

DEST[127:64] SRC1[127:64] DEST[MAXVL-1:128] 0


Intel C/C++ Compiler Intrinsic Equivalent

VGETMANTSD m128d _mm_getmant_sd( m128d a, m128 b, enum intv, enum sgn);

VGETMANTSD m128d _mm_mask_getmant_sd( m128d s, mmask8 k, m128d a, m128d b, enum intv, enum sgn); VGETMANTSD m128d _mm_maskz_getmant_sd( mmask8 k, m128 a, m128d b, enum intv, enum sgn); VGETMANTSD m128d _mm_getmant_round_sd( m128d a, m128 b, enum intv, enum sgn, int r);

VGETMANTSD m128d _mm_mask_getmant_round_sd( m128d s, mmask8 k, m128d a, m128d b, enum intv, enum sgn, int r); VGETMANTSD m128d _mm_maskz_getmant_round_sd( mmask8 k, m128d a, m128d b, enum intv, enum sgn, int r);


SIMD Floating-Point Exceptions

Denormal, Invalid


Other Exceptions

See Exceptions Type E3.


VGETMANTSS—Extract Float32 Vector of Normalized Mantissa from Float32 Vector

Opcode/ Instruction

Op/ En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.NDS.LIG.66.0F3A.W0 27 /r ib VGETMANTSS xmm1 {k1}{z}, xmm2,

xmm3/m32{sae}, imm8

A

V/V

AVX512F

Extract the normalized mantissa from the low float32 element of xmm3/m32 using imm8 for sign control and mantissa interval normalization, store the mantissa to xmm1 under the writemask k1 and merge with the other elements of xmm2.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Tuple1 Scalar

ModRM:reg (w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA

Description

Convert the single-precision floating values in the low doubleword element of the second source operand (the third operand) to SP FP value with the mantissa normalization and sign control specified by the imm8 byte, see

Figure 5-15. The converted result is written to the low doubleword element of the destination operand (the first operand) using writemask k1. Bits (127:32) of the XMM register destination are copied from corresponding bits in the first source operand. The normalized mantissa is specified by interv (imm8[1:0]) and the sign control (sc) is specified by bits 3:2 of the immediate byte.

The conversion operation is:

GetMant(x) = ±2k|x.significand|

where:


1 <= |x.significand| < 2


Unbiased exponent k depends on the interval range defined by interv and whether the exponent of the source is even or odd. The sign of the final result is determined by sc and the source sign.

if interv != 0 then k = -1, otherwise K = 0. The encoded value of imm8[1:0] and sign control are shown in Figure 5-15.

The converted SP FP result is encoded according to the sign control, the unbiased exponent k (adding bias) and a mantissa normalized to the range specified by interv.

The GetMant() function follows Table 5-7 when dealing with floating-point special numbers.

This instruction is writemasked, so only those elements with the corresponding bit set in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with the corresponding bit clear in k1 retain their previous values.



Operation

// GetNormalizeMantissaSP(SRC[31:0], SignCtrl[1:0], Interv[1:0]) is defined in the operation section of VGETMANTPD


VGETMANTSS (EVEX encoded version)

SignCtrl[1:0] IMM8[3:2];

Interv[1:0] IMM8[1:0];

IF k1[0] OR *no writemask* THEN DEST[31:0]

GetNormalizedMantissaSP(SRC2[31:0], SignCtrl, Interv)

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged*

ELSE ; zeroing-masking

DEST[31:0] 0

FI

FI;

DEST[127:32] SRC1[127:64] DEST[MAXVL-1:128] 0


Intel C/C++ Compiler Intrinsic Equivalent

VGETMANTSS m128 _mm_getmant_ss( m128 a, m128 b, enum intv, enum sgn);

VGETMANTSS m128 _mm_mask_getmant_ss( m128 s, mmask8 k, m128 a, m128 b, enum intv, enum sgn); VGETMANTSS m128 _mm_maskz_getmant_ss( mmask8 k, m128 a, m128 b, enum intv, enum sgn); VGETMANTSS m128 _mm_getmant_round_ss( m128 a, m128 b, enum intv, enum sgn, int r);

VGETMANTSS m128 _mm_mask_getmant_round_ss( m128 s, mmask8 k, m128 a, m128 b, enum intv, enum sgn, int r); VGETMANTSS m128 _mm_maskz_getmant_round_ss( mmask8 k, m128 a, m128 b, enum intv, enum sgn, int r);


SIMD Floating-Point Exceptions

Denormal, Invalid


Other Exceptions

See Exceptions Type E3.


VINSERTF128/VINSERTF32x4/VINSERTF64x2/VINSERTF32x8/VINSERTF64x4—Insert Packed

Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

VEX.NDS.256.66.0F3A.W0 18 /r ib VINSERTF128 ymm1, ymm2,

xmm3/m128, imm8

A

V/V

AVX

Insert 128 bits of packed floating-point values from xmm3/m128 and the remaining values from ymm2 into ymm1.

EVEX.NDS.256.66.0F3A.W0 18 /r ib VINSERTF32X4 ymm1 {k1}{z}, ymm2,

xmm3/m128, imm8

C

V/V

AVX512VL AVX512F

Insert 128 bits of packed single-precision floating- point values from xmm3/m128 and the remaining values from ymm2 into ymm1 under writemask k1.

EVEX.NDS.512.66.0F3A.W0 18 /r ib VINSERTF32X4 zmm1 {k1}{z}, zmm2,

xmm3/m128, imm8

C

V/V

AVX512F

Insert 128 bits of packed single-precision floating- point values from xmm3/m128 and the remaining values from zmm2 into zmm1 under writemask k1.

EVEX.NDS.256.66.0F3A.W1 18 /r ib VINSERTF64X2 ymm1 {k1}{z}, ymm2,

xmm3/m128, imm8

B

V/V

AVX512VL AVX512DQ

Insert 128 bits of packed double-precision floating- point values from xmm3/m128 and the remaining values from ymm2 into ymm1 under writemask k1.

EVEX.NDS.512.66.0F3A.W1 18 /r ib VINSERTF64X2 zmm1 {k1}{z}, zmm2,

xmm3/m128, imm8

B

V/V

AVX512DQ

Insert 128 bits of packed double-precision floating- point values from xmm3/m128 and the remaining values from zmm2 into zmm1 under writemask k1.

EVEX.NDS.512.66.0F3A.W0 1A /r ib VINSERTF32X8 zmm1 {k1}{z}, zmm2,

ymm3/m256, imm8

D

V/V

AVX512DQ

Insert 256 bits of packed single-precision floating- point values from ymm3/m256 and the remaining values from zmm2 into zmm1 under writemask k1.

EVEX.NDS.512.66.0F3A.W1 1A /r ib VINSERTF64X4 zmm1 {k1}{z}, zmm2,

ymm3/m256, imm8

C

V/V

AVX512F

Insert 256 bits of packed double-precision floating- point values from ymm3/m256 and the remaining values from zmm2 into zmm1 under writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

VEX.vvvv

ModRM:r/m (r)

Imm8

B

Tuple2

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

Imm8

C

Tuple4

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

Imm8

D

Tuple8

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

Imm8

Description

VINSERTF128/VINSERTF32x4 and VINSERTF64x2 insert 128-bits of packed floating-point values from the second source operand (the third operand) into the destination operand (the first operand) at an 128-bit granularity offset multiplied by imm8[0] (256-bit) or imm8[1:0]. The remaining portions of the destination operand are copied from the corresponding fields of the first source operand (the second operand). The second source operand can be either an XMM register or a 128-bit memory location. The destination and first source operands are vector registers.

VINSERTF32x4: The destination operand is a ZMM/YMM register and updated at 32-bit granularity according to the writemask. The high 6/7 bits of the immediate are ignored.

VINSERTF64x2: The destination operand is a ZMM/YMM register and updated at 64-bit granularity according to the writemask. The high 6/7 bits of the immediate are ignored.

VINSERTF32x8 and VINSERTF64x4 inserts 256-bits of packed floating-point values from the second source operand (the third operand) into the destination operand (the first operand) at a 256-bit granular offset multiplied by imm8[0]. The remaining portions of the destination are copied from the corresponding fields of the first source operand (the second operand). The second source operand can be either an YMM register or a 256-bit memory location. The high 7 bits of the immediate are ignored. The destination operand is a ZMM register and updated at 32/64-bit granularity according to the writemask.



Operation

VINSERTF32x4 (EVEX encoded versions) (KL, VL) = (8, 256), (16, 512) TEMP_DEST[VL-1:0] SRC1[VL-1:0]

IF VL = 256

CASE (imm8[0]) OF

0: TMP_DEST[127:0] SRC2[127:0]

1: TMP_DEST[255:128] SRC2[127:0] ESAC.

FI;

IF VL = 512

CASE (imm8[1:0]) OF

00: TMP_DEST[127:0] SRC2[127:0]

01: TMP_DEST[255:128] SRC2[127:0]

10: TMP_DEST[383:256] SRC2[127:0]

11: TMP_DEST[511:384] SRC2[127:0] ESAC.

FI;

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VINSERTF64x2 (EVEX encoded versions)

(KL, VL) = (4, 256), (8, 512) TEMP_DEST[VL-1:0] SRC1[VL-1:0] IF VL = 256

CASE (imm8[0]) OF

0: TMP_DEST[127:0] SRC2[127:0]

1: TMP_DEST[255:128] SRC2[127:0] ESAC.

FI;

IF VL = 512

CASE (imm8[1:0]) OF

00: TMP_DEST[127:0] SRC2[127:0]

01: TMP_DEST[255:128] SRC2[127:0]

10: TMP_DEST[383:256] SRC2[127:0]

11: TMP_DEST[511:384] SRC2[127:0] ESAC.

FI;

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE



IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VINSERTF32x8 (EVEX.U1.512 encoded version)

TEMP_DEST[VL-1:0] SRC1[VL-1:0] CASE (imm8[0]) OF

0: TMP_DEST[255:0] SRC2[255:0]

1: TMP_DEST[511:256] SRC2[255:0] ESAC.


FOR j 0 TO 15

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VINSERTF64x4 (EVEX.512 encoded version)

VL = 512

TEMP_DEST[VL-1:0] SRC1[VL-1:0] CASE (imm8[0]) OF

0: TMP_DEST[255:0] SRC2[255:0]

1: TMP_DEST[511:256] SRC2[255:0] ESAC.


FOR j 0 TO 7

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VINSERTF128 (VEX encoded version)

TEMP[255:0] SRC1[255:0] CASE (imm8[0]) OF

0: TEMP[127:0] SRC2[127:0]

1: TEMP[255:128] SRC2[127:0] ESAC

DEST TEMP


Intel C/C++ Compiler Intrinsic Equivalent

VINSERTF32x4 m512 _mm512_insertf32x4( m512 a, m128 b, int imm);

VINSERTF32x4 m512 _mm512_mask_insertf32x4( m512 s, mmask16 k, m512 a, m128 b, int imm); VINSERTF32x4 m512 _mm512_maskz_insertf32x4( mmask16 k, m512 a, m128 b, int imm); VINSERTF32x4 m256 _mm256_insertf32x4( m256 a, m128 b, int imm);

VINSERTF32x4 m256 _mm256_mask_insertf32x4( m256 s, mmask8 k, m256 a, m128 b, int imm); VINSERTF32x4 m256 _mm256_maskz_insertf32x4( mmask8 k, m256 a, m128 b, int imm); VINSERTF32x8 m512 _mm512_insertf32x8( m512 a, m256 b, int imm);

VINSERTF32x8 m512 _mm512_mask_insertf32x8( m512 s, mmask16 k, m512 a, m256 b, int imm); VINSERTF32x8 m512 _mm512_maskz_insertf32x8( mmask16 k, m512 a, m256 b, int imm); VINSERTF64x2 m512d _mm512_insertf64x2( m512d a, m128d b, int imm);

VINSERTF64x2 m512d _mm512_mask_insertf64x2( m512d s, mmask8 k, m512d a, m128d b, int imm); VINSERTF64x2 m512d _mm512_maskz_insertf64x2( mmask8 k, m512d a, m128d b, int imm); VINSERTF64x2 m256d _mm256_insertf64x2( m256d a, m128d b, int imm);

VINSERTF64x2 m256d _mm256_mask_insertf64x2( m256d s, mmask8 k, m256d a, m128d b, int imm); VINSERTF64x2 m256d _mm256_maskz_insertf64x2( mmask8 k, m256d a, m128d b, int imm); VINSERTF64x4 m512d _mm512_insertf64x4( m512d a, m256d b, int imm);

VINSERTF64x4 m512d _mm512_mask_insertf64x4( m512d s, mmask8 k, m512d a, m256d b, int imm); VINSERTF64x4 m512d _mm512_maskz_insertf64x4( mmask8 k, m512d a, m256d b, int imm); VINSERTF128 m256 _mm256_insertf128_ps ( m256 a, m128 b, int offset);

VINSERTF128 m256d _mm256_insertf128_pd ( m256d a, m128d b, int offset); VINSERTF128 m256i _mm256_insertf128_si256 ( m256i a, m128i b, int offset);


SIMD Floating-Point Exceptions

None


Other Exceptions

VEX-encoded instruction, see Exceptions Type 6; additionally

#UD If VEX.L = 0.

EVEX-encoded instruction, see Exceptions Type E6NF.


VINSERTI128/VINSERTI32x4/VINSERTI64x2/VINSERTI32x8/VINSERTI64x4—Insert Packed

Integer Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

VEX.NDS.256.66.0F3A.W0 38 /r ib VINSERTI128 ymm1, ymm2,

xmm3/m128, imm8

A

V/V

AVX2

Insert 128 bits of integer data from xmm3/m128 and the remaining values from ymm2 into ymm1.

EVEX.NDS.256.66.0F3A.W0 38 /r ib VINSERTI32X4 ymm1 {k1}{z}, ymm2,

xmm3/m128, imm8

C

V/V

AVX512VL AVX512F

Insert 128 bits of packed doubleword integer values from xmm3/m128 and the remaining values from ymm2 into ymm1 under writemask k1.

EVEX.NDS.512.66.0F3A.W0 38 /r ib VINSERTI32X4 zmm1 {k1}{z}, zmm2,

xmm3/m128, imm8

C

V/V

AVX512F

Insert 128 bits of packed doubleword integer values from xmm3/m128 and the remaining values from zmm2 into zmm1 under writemask k1.

EVEX.NDS.256.66.0F3A.W1 38 /r ib VINSERTI64X2 ymm1 {k1}{z}, ymm2,

xmm3/m128, imm8

B

V/V

AVX512VL AVX512DQ

Insert 128 bits of packed quadword integer values from xmm3/m128 and the remaining values from ymm2 into ymm1 under writemask k1.

EVEX.NDS.512.66.0F3A.W1 38 /r ib VINSERTI64X2 zmm1 {k1}{z}, zmm2,

xmm3/m128, imm8

B

V/V

AVX512DQ

Insert 128 bits of packed quadword integer values from xmm3/m128 and the remaining values from zmm2 into zmm1 under writemask k1.

EVEX.NDS.512.66.0F3A.W0 3A /r ib VINSERTI32X8 zmm1 {k1}{z}, zmm2,

ymm3/m256, imm8

D

V/V

AVX512DQ

Insert 256 bits of packed doubleword integer values from ymm3/m256 and the remaining values from zmm2 into zmm1 under writemask k1.

EVEX.NDS.512.66.0F3A.W1 3A /r ib VINSERTI64X4 zmm1 {k1}{z}, zmm2,

ymm3/m256, imm8

C

V/V

AVX512F

Insert 256 bits of packed quadword integer values from ymm3/m256 and the remaining values from zmm2 into zmm1 under writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

VEX.vvvv

ModRM:r/m (r)

Imm8

B

Tuple2

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

Imm8

C

Tuple4

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

Imm8

D

Tuple8

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

Imm8

Description

VINSERTI32x4 and VINSERTI64x2 inserts 128-bits of packed integer values from the second source operand (the third operand) into the destination operand (the first operand) at an 128-bit granular offset multiplied by imm8[0] (256-bit) or imm8[1:0]. The remaining portions of the destination are copied from the corresponding fields of the first source operand (the second operand). The second source operand can be either an XMM register or a 128-bit memory location. The high 6/7bits of the immediate are ignored. The destination operand is a ZMM/YMM register and updated at 32 and 64-bit granularity according to the writemask.

VINSERTI32x8 and VINSERTI64x4 inserts 256-bits of packed integer values from the second source operand (the third operand) into the destination operand (the first operand) at a 256-bit granular offset multiplied by imm8[0]. The remaining portions of the destination are copied from the corresponding fields of the first source operand (the second operand). The second source operand can be either an YMM register or a 256-bit memory location. The upper bits of the immediate are ignored. The destination operand is a ZMM register and updated at 32 and 64-bit granularity according to the writemask.

VINSERTI128 inserts 128-bits of packed integer data from the second source operand (the third operand) into the destination operand (the first operand) at a 128-bit granular offset multiplied by imm8[0]. The remaining portions of the destination are copied from the corresponding fields of the first source operand (the second operand). The second source operand can be either an XMM register or a 128-bit memory location. The high 7 bits of the imme- diate are ignored. VEX.L must be 1, otherwise attempt to execute this instruction with VEX.L=0 will cause #UD.



Operation

VINSERTI32x4 (EVEX encoded versions) (KL, VL) = (8, 256), (16, 512) TEMP_DEST[VL-1:0] SRC1[VL-1:0]

IF VL = 256

CASE (imm8[0]) OF

0: TMP_DEST[127:0] SRC2[127:0]

1: TMP_DEST[255:128] SRC2[127:0] ESAC.

FI;

IF VL = 512

CASE (imm8[1:0]) OF

00: TMP_DEST[127:0] SRC2[127:0]

01: TMP_DEST[255:128] SRC2[127:0]

10: TMP_DEST[383:256] SRC2[127:0]

11: TMP_DEST[511:384] SRC2[127:0] ESAC.

FI;

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VINSERTI64x2 (EVEX encoded versions) (KL, VL) = (4, 256), (8, 512) TEMP_DEST[VL-1:0] SRC1[VL-1:0]

IF VL = 256

CASE (imm8[0]) OF

0: TMP_DEST[127:0] SRC2[127:0]

1: TMP_DEST[255:128] SRC2[127:0] ESAC.

FI;

IF VL = 512

CASE (imm8[1:0]) OF

00: TMP_DEST[127:0] SRC2[127:0]

01: TMP_DEST[255:128] SRC2[127:0]

10: TMP_DEST[383:256] SRC2[127:0]

11: TMP_DEST[511:384] SRC2[127:0] ESAC.

FI;

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE



IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VINSERTI32x8 (EVEX.U1.512 encoded version)

TEMP_DEST[VL-1:0] SRC1[VL-1:0] CASE (imm8[0]) OF

0: TMP_DEST[255:0] SRC2[255:0]

1: TMP_DEST[511:256] SRC2[255:0] ESAC.


FOR j 0 TO 15

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VINSERTI64x4 (EVEX.512 encoded version)

VL = 512

TEMP_DEST[VL-1:0] SRC1[VL-1:0] CASE (imm8[0]) OF

0: TMP_DEST[255:0] SRC2[255:0]

1: TMP_DEST[511:256] SRC2[255:0] ESAC.


FOR j 0 TO 7

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VINSERTI128

TEMP[255:0] SRC1[255:0] CASE (imm8[0]) OF

0: TEMP[127:0] SRC2[127:0]

1: TEMP[255:128] SRC2[127:0] ESAC

DEST TEMP


Intel C/C++ Compiler Intrinsic Equivalent

VINSERTI32x4 _mm512i _inserti32x4( m512i a, m128i b, int imm);

VINSERTI32x4 _mm512i _mask_inserti32x4( m512i s, mmask16 k, m512i a, m128i b, int imm); VINSERTI32x4 _mm512i _maskz_inserti32x4( mmask16 k, m512i a, m128i b, int imm); VINSERTI32x4 m256i _mm256_inserti32x4( m256i a, m128i b, int imm);

VINSERTI32x4 m256i _mm256_mask_inserti32x4( m256i s, mmask8 k, m256i a, m128i b, int imm); VINSERTI32x4 m256i _mm256_maskz_inserti32x4( mmask8 k, m256i a, m128i b, int imm); VINSERTI32x8 m512i _mm512_inserti32x8( m512i a, m256i b, int imm);

VINSERTI32x8 m512i _mm512_mask_inserti32x8( m512i s, mmask16 k, m512i a, m256i b, int imm); VINSERTI32x8 m512i _mm512_maskz_inserti32x8( mmask16 k, m512i a, m256i b, int imm); VINSERTI64x2 m512i _mm512_inserti64x2( m512i a, m128i b, int imm);

VINSERTI64x2 m512i _mm512_mask_inserti64x2( m512i s, mmask8 k, m512i a, m128i b, int imm); VINSERTI64x2 m512i _mm512_maskz_inserti64x2( mmask8 k, m512i a, m128i b, int imm); VINSERTI64x2 m256i _mm256_inserti64x2( m256i a, m128i b, int imm);

VINSERTI64x2 m256i _mm256_mask_inserti64x2( m256i s, mmask8 k, m256i a, m128i b, int imm); VINSERTI64x2 m256i _mm256_maskz_inserti64x2( mmask8 k, m256i a, m128i b, int imm); VINSERTI64x4 _mm512_inserti64x4( m512i a, m256i b, int imm);

VINSERTI64x4 _mm512_mask_inserti64x4( m512i s, mmask8 k, m512i a, m256i b, int imm); VINSERTI64x4 _mm512_maskz_inserti64x4( mmask m, m512i a, m256i b, int imm); VINSERTI128 m256i _mm256_insertf128_si256 ( m256i a, m128i b, int offset);


SIMD Floating-Point Exceptions

None


Other Exceptions

VEX-encoded instruction, see Exceptions Type 6; additionally

#UD If VEX.L = 0.

EVEX-encoded instruction, see Exceptions Type E6NF.


VMASKMOV—Conditional SIMD Packed Loads and Stores

Opcode/ Instruction

Op/ En

64/32-bit Mode

CPUID

Feature Flag

Description

VEX.NDS.128.66.0F38.W0 2C /r

VMASKMOVPS xmm1, xmm2, m128

RVM

V/V

AVX

Conditionally load packed single-precision values from

m128 using mask in xmm2 and store in xmm1.

VEX.NDS.256.66.0F38.W0 2C /r

VMASKMOVPS ymm1, ymm2, m256

RVM

V/V

AVX

Conditionally load packed single-precision values from

m256 using mask in ymm2 and store in ymm1.

VEX.NDS.128.66.0F38.W0 2D /r

VMASKMOVPD xmm1, xmm2, m128

RVM

V/V

AVX

Conditionally load packed double-precision values from

m128 using mask in xmm2 and store in xmm1.

VEX.NDS.256.66.0F38.W0 2D /r

VMASKMOVPD ymm1, ymm2, m256

RVM

V/V

AVX

Conditionally load packed double-precision values from

m256 using mask in ymm2 and store in ymm1.

VEX.NDS.128.66.0F38.W0 2E /r

VMASKMOVPS m128, xmm1, xmm2

MVR

V/V

AVX

Conditionally store packed single-precision values from

xmm2 using mask in xmm1.

VEX.NDS.256.66.0F38.W0 2E /r

VMASKMOVPS m256, ymm1, ymm2

MVR

V/V

AVX

Conditionally store packed single-precision values from

ymm2 using mask in ymm1.

VEX.NDS.128.66.0F38.W0 2F /r

VMASKMOVPD m128, xmm1, xmm2

MVR

V/V

AVX

Conditionally store packed double-precision values from

xmm2 using mask in xmm1.

VEX.NDS.256.66.0F38.W0 2F /r

VMASKMOVPD m256, ymm1, ymm2

MVR

V/V

AVX

Conditionally store packed double-precision values from

ymm2 using mask in ymm1.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

RVM

ModRM:reg (w)

VEX.vvvv (r)

ModRM:r/m (r)

NA

MVR

ModRM:r/m (w)

VEX.vvvv (r)

ModRM:reg (r)

NA


Description

Conditionally moves packed data elements from the second source operand into the corresponding data element of the destination operand, depending on the mask bits associated with each data element. The mask bits are specified in the first source operand.

The mask bit for each data element is the most significant bit of that element in the first source operand. If a mask is 1, the corresponding data element is copied from the second source operand to the destination operand. If the mask is 0, the corresponding data element is set to zero in the load form of these instructions, and unmodified in the store form.

The second source operand is a memory address for the load form of these instruction. The destination operand is a memory address for the store form of these instructions. The other operands are both XMM registers (for VEX.128 version) or YMM registers (for VEX.256 version).

Faults occur only due to mask-bit required memory accesses that caused the faults. Faults will not occur due to referencing any memory location if the corresponding mask bit for that memory location is 0. For example, no faults will be detected if the mask bits are all zero.

Unlike previous MASKMOV instructions (MASKMOVQ and MASKMOVDQU), a nontemporal hint is not applied to these instructions.

Instruction behavior on alignment check reporting with mask bits of less than all 1s are the same as with mask bits of all 1s.

VMASKMOV should not be used to access memory mapped I/O and un-cached memory as the access and the ordering of the individual loads or stores it does is implementation specific.



In cases where mask bits indicate data should not be loaded or stored paging A and D bits will be set in an imple- mentation dependent way. However, A and D bits are always set for pages where data is actually loaded/stored.

Note: for load forms, the first source (the mask) is encoded in VEX.vvvv; the second source is encoded in rm_field, and the destination register is encoded in reg_field.

Note: for store forms, the first source (the mask) is encoded in VEX.vvvv; the second source register is encoded in reg_field, and the destination memory location is encoded in rm_field.


Operation

VMASKMOVPS -128-bit load

DEST[31:0] IF (SRC1[31]) Load_32(mem) ELSE 0 DEST[63:32] IF (SRC1[63]) Load_32(mem + 4) ELSE 0 DEST[95:64] IF (SRC1[95]) Load_32(mem + 8) ELSE 0 DEST[127:97] IF (SRC1[127]) Load_32(mem + 12) ELSE 0 DEST[MAXVL-1:128] 0


VMASKMOVPS - 256-bit load

DEST[31:0] IF (SRC1[31]) Load_32(mem) ELSE 0 DEST[63:32] IF (SRC1[63]) Load_32(mem + 4) ELSE 0 DEST[95:64] IF (SRC1[95]) Load_32(mem + 8) ELSE 0 DEST[127:96] IF (SRC1[127]) Load_32(mem + 12) ELSE 0 DEST[159:128] IF (SRC1[159]) Load_32(mem + 16) ELSE 0 DEST[191:160] IF (SRC1[191]) Load_32(mem + 20) ELSE 0 DEST[223:192] IF (SRC1[223]) Load_32(mem + 24) ELSE 0 DEST[255:224] IF (SRC1[255]) Load_32(mem + 28) ELSE 0


VMASKMOVPD - 128-bit load

DEST[63:0] IF (SRC1[63]) Load_64(mem) ELSE 0 DEST[127:64] IF (SRC1[127]) Load_64(mem + 16) ELSE 0 DEST[MAXVL-1:128] 0


VMASKMOVPD - 256-bit load

DEST[63:0] IF (SRC1[63]) Load_64(mem) ELSE 0 DEST[127:64] IF (SRC1[127]) Load_64(mem + 8) ELSE 0 DEST[195:128] IF (SRC1[191]) Load_64(mem + 16) ELSE 0 DEST[255:196] IF (SRC1[255]) Load_64(mem + 24) ELSE 0


VMASKMOVPS - 128-bit store

IF (SRC1[31]) DEST[31:0] SRC2[31:0]

IF (SRC1[63]) DEST[63:32] SRC2[63:32] IF (SRC1[95]) DEST[95:64] SRC2[95:64]

IF (SRC1[127]) DEST[127:96] SRC2[127:96]


VMASKMOVPS - 256-bit store

IF (SRC1[31]) DEST[31:0] SRC2[31:0]

IF (SRC1[63]) DEST[63:32] SRC2[63:32] IF (SRC1[95]) DEST[95:64] SRC2[95:64]

IF (SRC1[127]) DEST[127:96] SRC2[127:96] IF (SRC1[159]) DEST[159:128] SRC2[159:128]

IF (SRC1[191]) DEST[191:160] SRC2[191:160] IF (SRC1[223]) DEST[223:192] SRC2[223:192] IF (SRC1[255]) DEST[255:224] SRC2[255:224]



VMASKMOVPD - 128-bit store

IF (SRC1[63]) DEST[63:0] SRC2[63:0]

IF (SRC1[127]) DEST[127:64] SRC2[127:64]


VMASKMOVPD - 256-bit store

IF (SRC1[63]) DEST[63:0] SRC2[63:0]

IF (SRC1[127]) DEST[127:64] SRC2[127:64]

IF (SRC1[191]) DEST[191:128] SRC2[191:128] IF (SRC1[255]) DEST[255:192] SRC2[255:192]


Intel C/C Compiler Intrinsic Equivalent

m256 _mm256_maskload_ps(float const *a, m256i mask) void _mm256_maskstore_ps(float *a, m256i mask, m256 b)

m256d _mm256_maskload_pd(double *a, m256i mask);

void _mm256_maskstore_pd(double *a, m256i mask, m256d b);

m128 _mm_maskload_ps(float const *a, m128i mask) void _mm_maskstore_ps(float *a, m128i mask, m128 b)

m128d _mm_maskload_pd(double const *a, m128i mask); void _mm_maskstore_pd(double *a, m128i mask, m128d b);

SIMD Floating-Point Exceptions

None


Other Exceptions

See Exceptions Type 6 (No AC# reported for any mask bit combinations); additionally

#UD If VEX.W = 1.


VPBLENDD — Blend Packed Dwords

Opcode/ Instruction

Op/ En

64/32

-bit Mode

CPUID

Feature Flag

Description

VEX.NDS.128.66.0F3A.W0 02 /r ib

VPBLENDD xmm1, xmm2, xmm3/m128, imm8

RVMI

V/V

AVX2

Select dwords from xmm2 and xmm3/m128 from mask specified in imm8 and store the values into xmm1.

VEX.NDS.256.66.0F3A.W0 02 /r ib

VPBLENDD ymm1, ymm2, ymm3/m256, imm8

RVMI

V/V

AVX2

Select dwords from ymm2 and ymm3/m256 from mask specified in imm8 and store the values into ymm1.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

RVMI

ModRM:reg (w)

VEX.vvvv

ModRM:r/m (r)

Imm8


Description

Dword elements from the source operand (second operand) are conditionally written to the destination operand (first operand) depending on bits in the immediate operand (third operand). The immediate bits (bits 7:0) form a mask that determines whether the corresponding word in the destination is copied from the source. If a bit in the mask, corresponding to a word, is “1", then the word is copied, else the word is unchanged.

VEX.128 encoded version: The second source operand can be an XMM register or a 128-bit memory location. The first source and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM register are zeroed.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register or a 256-bit memory location. The destination operand is a YMM register.


Operation

VPBLENDD (VEX.256 encoded version)

IF (imm8[0] == 1) THEN DEST[31:0] SRC2[31:0] ELSE DEST[31:0] SRC1[31:0]

IF (imm8[1] == 1) THEN DEST[63:32] SRC2[63:32] ELSE DEST[63:32] SRC1[63:32]

IF (imm8[2] == 1) THEN DEST[95:64] SRC2[95:64] ELSE DEST[95:64] SRC1[95:64]

IF (imm8[3] == 1) THEN DEST[127:96] SRC2[127:96] ELSE DEST[127:96] SRC1[127:96]

IF (imm8[4] == 1) THEN DEST[159:128] SRC2[159:128] ELSE DEST[159:128] SRC1[159:128]

IF (imm8[5] == 1) THEN DEST[191:160] SRC2[191:160] ELSE DEST[191:160] SRC1[191:160]

IF (imm8[6] == 1) THEN DEST[223:192] SRC2[223:192] ELSE DEST[223:192] SRC1[223:192]

IF (imm8[7] == 1) THEN DEST[255:224] SRC2[255:224] ELSE DEST[255:224] SRC1[255:224]



VPBLENDD (VEX.128 encoded version)

IF (imm8[0] == 1) THEN DEST[31:0] SRC2[31:0] ELSE DEST[31:0] SRC1[31:0]

IF (imm8[1] == 1) THEN DEST[63:32] SRC2[63:32] ELSE DEST[63:32] SRC1[63:32]

IF (imm8[2] == 1) THEN DEST[95:64] SRC2[95:64] ELSE DEST[95:64] SRC1[95:64]

IF (imm8[3] == 1) THEN DEST[127:96] SRC2[127:96] ELSE DEST[127:96] SRC1[127:96]

DEST[MAXVL-1:128] 0


Intel C/C++ Compiler Intrinsic Equivalent

VPBLENDD: VPBLENDD:

m128i _mm_blend_epi32 ( m128i v1, m128i v2, const int mask)

m256i _mm256_blend_epi32 ( m256i v1, m256i v2, const int mask)

SIMD Floating-Point Exceptions

None


Other Exceptions

See Exceptions Type 4; additionally

#UD If VEX.W = 1.


VPBLENDMB/VPBLENDMW—Blend Byte/Word Vectors Using an Opmask Control

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.NDS.128.66.0F38.W0 66 /r VPBLENDMB xmm1 {k1}{z},

xmm2, xmm3/m128

A

V/V

AVX512VL AVX512BW

Blend byte integer vector xmm2 and byte vector xmm3/m128 and store the result in xmm1, under control mask.

EVEX.NDS.256.66.0F38.W0 66 /r VPBLENDMB ymm1 {k1}{z},

ymm2, ymm3/m256

A

V/V

AVX512VL AVX512BW

Blend byte integer vector ymm2 and byte vector ymm3/m256 and store the result in ymm1, under control mask.

EVEX.NDS.512.66.0F38.W0 66 /r VPBLENDMB zmm1 {k1}{z},

zmm2, zmm3/m512

A

V/V

AVX512BW

Blend byte integer vector zmm2 and byte vector zmm3/m512 and store the result in zmm1, under control mask.

EVEX.NDS.128.66.0F38.W1 66 /r VPBLENDMW xmm1 {k1}{z},

xmm2, xmm3/m128

A

V/V

AVX512VL AVX512BW

Blend word integer vector xmm2 and word vector xmm3/m128 and store the result in xmm1, under control mask.

EVEX.NDS.256.66.0F38.W1 66 /r VPBLENDMW ymm1 {k1}{z},

ymm2, ymm3/m256

A

V/V

AVX512VL AVX512BW

Blend word integer vector ymm2 and word vector ymm3/m256 and store the result in ymm1, under control mask.

EVEX.NDS.512.66.0F38.W1 66 /r VPBLENDMW zmm1 {k1}{z},

zmm2, zmm3/m512

A

V/V

AVX512BW

Blend word integer vector zmm2 and word vector zmm3/m512 and store the result in zmm1, under control mask.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full Mem

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

NA

Description

Performs an element-by-element blending of byte/word elements between the first source operand byte vector register and the second source operand byte vector from memory or register, using the instruction mask as selector. The result is written into the destination byte vector register.

The destination and first source operands are ZMM/YMM/XMM registers. The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit memory location.

The mask is not used as a writemask for this instruction. Instead, the mask is used as an element selector: every element of the destination is conditionally selected between first source or second source using the value of the related mask bit (0 for first source, 1 for second source).



Operation

VPBLENDMB (EVEX encoded versions)

(KL, VL) = (16, 128), (32, 256), (64, 512)


FOR j 0 TO KL-1

i j * 8

IF k1[j] OR *no writemask*

THEN DEST[i+7:i] SRC2[i+7:i] ELSE

IF *merging-masking* ; merging-masking THEN DEST[i+7:i] SRC1[i+7:i]

ELSE ; zeroing-masking

DEST[i+7:i] 0

FI;

FI;

ENDFOR

DEST[MAXVL-1:VL] 0;


VPBLENDMW (EVEX encoded versions)

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j 0 TO KL-1

i j * 16

IF k1[j] OR *no writemask*

THEN DEST[i+15:i] SRC2[i+15:i] ELSE

IF *merging-masking* ; merging-masking THEN DEST[i+15:i] SRC1[i+15:i]

ELSE ; zeroing-masking

DEST[i+15:i] 0

FI;

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


Intel C/C++ Compiler Intrinsic Equivalent

VPBLENDMB m512i _mm512_mask_blend_epi8( mmask64 m, m512i a, m512i b); VPBLENDMB m256i _mm256_mask_blend_epi8( mmask32 m, m256i a, m256i b); VPBLENDMB m128i _mm_mask_blend_epi8( mmask16 m, m128i a, m128i b); VPBLENDMW m512i _mm512_mask_blend_epi16( mmask32 m, m512i a, m512i b); VPBLENDMW m256i _mm256_mask_blend_epi16( mmask16 m, m256i a, m256i b); VPBLENDMW m128i _mm_mask_blend_epi16( mmask8 m, m128i a, m128i b);


SIMD Floating-Point Exceptions

None


Other Exceptions

See Exceptions Type E4.


VPBLENDMD/VPBLENDMQ—Blend Int32/Int64 Vectors Using an OpMask Control

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.NDS.128.66.0F38.W0 64 /r VPBLENDMD xmm1 {k1}{z},

xmm2, xmm3/m128/m32bcst

A

V/V

AVX512VL AVX512F

Blend doubleword integer vector xmm2 and doubleword vector xmm3/m128/m32bcst and store the result in xmm1, under control mask.

EVEX.NDS.256.66.0F38.W0 64 /r

VPBLENDMD ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst

A

V/V

AVX512VL AVX512F

Blend doubleword integer vector ymm2 and doubleword vector ymm3/m256/m32bcst and store the result in ymm1, under control mask.

EVEX.NDS.512.66.0F38.W0 64 /r

VPBLENDMD zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst

A

V/V

AVX512F

Blend doubleword integer vector zmm2 and doubleword vector zmm3/m512/m32bcst and store the result in zmm1, under control mask.

EVEX.NDS.128.66.0F38.W1 64 /r VPBLENDMQ xmm1 {k1}{z},

xmm2, xmm3/m128/m64bcst

A

V/V

AVX512VL AVX512F

Blend quadword integer vector xmm2 and quadword vector xmm3/m128/m64bcst and store the result in xmm1, under control mask.

EVEX.NDS.256.66.0F38.W1 64 /r VPBLENDMQ ymm1 {k1}{z},

ymm2, ymm3/m256/m64bcst

A

V/V

AVX512VL AVX512F

Blend quadword integer vector ymm2 and quadword vector ymm3/m256/m64bcst and store the result in ymm1, under control mask.

EVEX.NDS.512.66.0F38.W1 64 /r

VPBLENDMQ zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst

A

V/V

AVX512F

Blend quadword integer vector zmm2 and quadword vector zmm3/m512/m64bcst and store the result in zmm1, under control mask.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

NA

Description

Performs an element-by-element blending of dword/qword elements between the first source operand (the second operand) and the elements of the second source operand (the third operand) using an opmask register as select control. The blended result is written into the destination.

The destination and first source operands are ZMM registers. The second source operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32-bit memory location.

The opmask register is not used as a writemask for this instruction. Instead, the mask is used as an element selector: every element of the destination is conditionally selected between first source or second source using the value of the related mask bit (0 for the first source operand, 1 for the second source operand).

If EVEX.z is set, the elements with corresponding mask bit value of 0 in the destination operand are zeroed.



Operation

VPBLENDMD (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no controlmask* THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN

DEST[i+31:i] SRC2[31:0] ELSE

DEST[i+31:i] SRC2[i+31:i]

FI;

ELSE

IF *merging-masking* ; merging-masking THEN DEST[i+31:i] SRC1[i+31:i]

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI;

FI;

ENDFOR

DEST[MAXVL-1:VL] 0;


VPBLENDMD (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no controlmask* THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN

DEST[i+31:i] SRC2[31:0] ELSE

DEST[i+31:i] SRC2[i+31:i]

FI;

ELSE

IF *merging-masking* ; merging-masking THEN DEST[i+31:i] SRC1[i+31:i]

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI;

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



Intel C/C++ Compiler Intrinsic Equivalent

VPBLENDMD m512i _mm512_mask_blend_epi32( mmask16 k, m512i a, m512i b); VPBLENDMD m256i _mm256_mask_blend_epi32( mmask8 m, m256i a, m256i b); VPBLENDMD m128i _mm_mask_blend_epi32( mmask8 m, m128i a, m128i b); VPBLENDMQ m512i _mm512_mask_blend_epi64( mmask8 k, m512i a, m512i b); VPBLENDMQ m256i _mm256_mask_blend_epi64( mmask8 m, m256i a, m256i b); VPBLENDMQ m128i _mm_mask_blend_epi64( mmask8 m, m128i a, m128i b);


SIMD Floating-Point Exceptions

None


Other Exceptions

See Exceptions Type E4.


VPBROADCASTB/W/D/Q—Load with Broadcast Integer Data from General Purpose Register

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.66.0F38.W0 7A /r VPBROADCASTB xmm1 {k1}{z}, reg

A

V/V

AVX512VL AVX512BW

Broadcast an 8-bit value from a GPR to all bytes in the 128-bit destination subject to writemask k1.

EVEX.256.66.0F38.W0 7A /r VPBROADCASTB ymm1 {k1}{z}, reg

A

V/V

AVX512VL AVX512BW

Broadcast an 8-bit value from a GPR to all bytes in the 256-bit destination subject to writemask k1.

EVEX.512.66.0F38.W0 7A /r VPBROADCASTB zmm1 {k1}{z}, reg

A

V/V

AVX512BW

Broadcast an 8-bit value from a GPR to all bytes in the 512-bit destination subject to writemask k1.

EVEX.128.66.0F38.W0 7B /r VPBROADCASTW xmm1 {k1}{z}, reg

A

V/V

AVX512VL AVX512BW

Broadcast a 16-bit value from a GPR to all words in the 128-bit destination subject to writemask k1.

EVEX.256.66.0F38.W0 7B /r VPBROADCASTW ymm1 {k1}{z}, reg

A

V/V

AVX512VL AVX512BW

Broadcast a 16-bit value from a GPR to all words in the 256-bit destination subject to writemask k1.

EVEX.512.66.0F38.W0 7B /r VPBROADCASTW zmm1 {k1}{z}, reg

A

V/V

AVX512BW

Broadcast a 16-bit value from a GPR to all words in the 512-bit destination subject to writemask k1.

EVEX.128.66.0F38.W0 7C /r VPBROADCASTD xmm1 {k1}{z}, r32

A

V/V

AVX512VL AVX512F

Broadcast a 32-bit value from a GPR to all double-words in the 128-bit destination subject to writemask k1.

EVEX.256.66.0F38.W0 7C /r VPBROADCASTD ymm1 {k1}{z}, r32

A

V/V

AVX512VL AVX512F

Broadcast a 32-bit value from a GPR to all double-words in the 256-bit destination subject to writemask k1.

EVEX.512.66.0F38.W0 7C /r VPBROADCASTD zmm1 {k1}{z}, r32

A

V/V

AVX512F

Broadcast a 32-bit value from a GPR to all double-words in the 512-bit destination subject to writemask k1.

EVEX.128.66.0F38.W1 7C /r VPBROADCASTQ xmm1 {k1}{z}, r64

A

V/N.E.1

AVX512VL AVX512F

Broadcast a 64-bit value from a GPR to all quad-words in the 128-bit destination subject to writemask k1.

EVEX.256.66.0F38.W1 7C /r VPBROADCASTQ ymm1 {k1}{z}, r64

A

V/N.E.1

AVX512VL AVX512F

Broadcast a 64-bit value from a GPR to all quad-words in the 256-bit destination subject to writemask k1.

EVEX.512.66.0F38.W1 7C /r VPBROADCASTQ zmm1 {k1}{z}, r64

A

V/N.E.1

AVX512F

Broadcast a 64-bit value from a GPR to all quad-words in the 512-bit destination subject to writemask k1.

NOTES:

  1. EVEX.W in non-64 bit is ignored; the instructions behaves as if the W0 version is used.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Tuple1 Scalar

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Broadcasts a 8-bit, 16-bit, 32-bit or 64-bit value from a general-purpose register (the second operand) to all the locations in the destination vector register (the first operand) using the writemask k1.

EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.



Operation

VPBROADCASTB (EVEX encoded versions)

(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j 0 TO KL-1

i j * 8

IF k1[j] OR *no writemask*

THEN DEST[i+7:i] SRC[7:0] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+7:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+7:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VPBROADCASTW (EVEX encoded versions)

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j 0 TO KL-1

i j * 16

IF k1[j] OR *no writemask*

THEN DEST[i+15:i] SRC[15:0] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+15:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VPBROADCASTD (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] SRC[31:0] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VPBROADCASTQ (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[63:0] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


Intel C/C++ Compiler Intrinsic Equivalent

VPBROADCASTB m512i _mm512_mask_set1_epi8( m512i s, mmask64 k, int a); VPBROADCASTB m512i _mm512_maskz_set1_epi8( mmask64 k, int a); VPBROADCASTB m256i _mm256_mask_set1_epi8( m256i s, mmask32 k, int a); VPBROADCASTB m256i _mm256_maskz_set1_epi8( mmask32 k, int a); VPBROADCASTB m128i _mm_mask_set1_epi8( m128i s, mmask16 k, int a); VPBROADCASTB m128i _mm_maskz_set1_epi8( mmask16 k, int a); VPBROADCASTD m512i _mm512_mask_set1_epi32( m512i s, mmask16 k, int a); VPBROADCASTD m512i _mm512_maskz_set1_epi32( mmask16 k, int a); VPBROADCASTD m256i _mm256_mask_set1_epi32( m256i s, mmask8 k, int a); VPBROADCASTD m256i _mm256_maskz_set1_epi32( mmask8 k, int a); VPBROADCASTD m128i _mm_mask_set1_epi32( m128i s, mmask8 k, int a); VPBROADCASTD m128i _mm_maskz_set1_epi32( mmask8 k, int a);

VPBROADCASTQ m512i _mm512_mask_set1_epi64( m512i s, mmask8 k, int64 a); VPBROADCASTQ m512i _mm512_maskz_set1_epi64( mmask8 k, int64 a); VPBROADCASTQ m256i _mm256_mask_set1_epi64( m256i s, mmask8 k, int64 a); VPBROADCASTQ m256i _mm256_maskz_set1_epi64( mmask8 k, int64 a); VPBROADCASTQ m128i _mm_mask_set1_epi64( m128i s, mmask8 k, int64 a); VPBROADCASTQ m128i _mm_maskz_set1_epi64( mmask8 k, int64 a); VPBROADCASTW m512i _mm512_mask_set1_epi16( m512i s, mmask32 k, int a); VPBROADCASTW m512i _mm512_maskz_set1_epi16( mmask32 k, int a); VPBROADCASTW m256i _mm256_mask_set1_epi16( m256i s, mmask16 k, int a); VPBROADCASTW m256i _mm256_maskz_set1_epi16( mmask16 k, int a); VPBROADCASTW m128i _mm_mask_set1_epi16( m128i s, mmask8 k, int a); VPBROADCASTW m128i _mm_maskz_set1_epi16( mmask8 k, int a);


Exceptions

EVEX-encoded instructions, see Exceptions Type E7NM.

#UD If EVEX.vvvv != 1111B.


VPBROADCAST—Load Integer and Broadcast

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

VEX.128.66.0F38.W0 78 /r

VPBROADCASTB xmm1, xmm2/m8

A

V/V

AVX2

Broadcast a byte integer in the source operand to sixteen locations in xmm1.

VEX.256.66.0F38.W0 78 /r

VPBROADCASTB ymm1, xmm2/m8

A

V/V

AVX2

Broadcast a byte integer in the source operand to thirty-two locations in ymm1.

EVEX.128.66.0F38.W0 78 /r

VPBROADCASTB xmm1{k1}{z}, xmm2/m8

B

V/V

AVX512VL AVX512BW

Broadcast a byte integer in the source operand to locations in xmm1 subject to writemask k1.

EVEX.256.66.0F38.W0 78 /r

VPBROADCASTB ymm1{k1}{z}, xmm2/m8

B

V/V

AVX512VL AVX512BW

Broadcast a byte integer in the source operand to locations in ymm1 subject to writemask k1.

EVEX.512.66.0F38.W0 78 /r

VPBROADCASTB zmm1{k1}{z}, xmm2/m8

B

V/V

AVX512BW

Broadcast a byte integer in the source operand to 64 locations in zmm1 subject to writemask k1.

VEX.128.66.0F38.W0 79 /r

VPBROADCASTW xmm1, xmm2/m16

A

V/V

AVX2

Broadcast a word integer in the source operand to eight locations in xmm1.

VEX.256.66.0F38.W0 79 /r

VPBROADCASTW ymm1, xmm2/m16

A

V/V

AVX2

Broadcast a word integer in the source operand to sixteen locations in ymm1.

EVEX.128.66.0F38.W0 79 /r

VPBROADCASTW xmm1{k1}{z}, xmm2/m16

B

V/V

AVX512VL AVX512BW

Broadcast a word integer in the source operand to locations in xmm1 subject to writemask k1.

EVEX.256.66.0F38.W0 79 /r

VPBROADCASTW ymm1{k1}{z}, xmm2/m16

B

V/V

AVX512VL AVX512BW

Broadcast a word integer in the source operand to locations in ymm1 subject to writemask k1.

EVEX.512.66.0F38.W0 79 /r

VPBROADCASTW zmm1{k1}{z}, xmm2/m16

B

V/V

AVX512BW

Broadcast a word integer in the source operand to 32 locations in zmm1 subject to writemask k1.

VEX.128.66.0F38.W0 58 /r

VPBROADCASTD xmm1, xmm2/m32

A

V/V

AVX2

Broadcast a dword integer in the source operand to four locations in xmm1.

VEX.256.66.0F38.W0 58 /r

VPBROADCASTD ymm1, xmm2/m32

A

V/V

AVX2

Broadcast a dword integer in the source operand to eight locations in ymm1.

EVEX.128.66.0F38.W0 58 /r

VPBROADCASTD xmm1 {k1}{z}, xmm2/m32

B

V/V

AVX512VL AVX512F

Broadcast a dword integer in the source operand to locations in xmm1 subject to writemask k1.

EVEX.256.66.0F38.W0 58 /r

VPBROADCASTD ymm1 {k1}{z}, xmm2/m32

B

V/V

AVX512VL AVX512F

Broadcast a dword integer in the source operand to locations in ymm1 subject to writemask k1.

EVEX.512.66.0F38.W0 58 /r

VPBROADCASTD zmm1 {k1}{z}, xmm2/m32

B

V/V

AVX512F

Broadcast a dword integer in the source operand to locations in zmm1 subject to writemask k1.

VEX.128.66.0F38.W0 59 /r

VPBROADCASTQ xmm1, xmm2/m64

A

V/V

AVX2

Broadcast a qword element in source operand to two locations in xmm1.

VEX.256.66.0F38.W0 59 /r

VPBROADCASTQ ymm1, xmm2/m64

A

V/V

AVX2

Broadcast a qword element in source operand to four locations in ymm1.

EVEX.128.66.0F38.W1 59 /r

VPBROADCASTQ xmm1 {k1}{z}, xmm2/m64

B

V/V

AVX512VL AVX512F

Broadcast a qword element in source operand to locations in xmm1 subject to writemask k1.

EVEX.256.66.0F38.W1 59 /r

VPBROADCASTQ ymm1 {k1}{z}, xmm2/m64

B

V/V

AVX512VL AVX512F

Broadcast a qword element in source operand to locations in ymm1 subject to writemask k1.

EVEX.512.66.0F38.W1 59 /r

VPBROADCASTQ zmm1 {k1}{z}, xmm2/m64

B

V/V

AVX512F

Broadcast a qword element in source operand to locations in zmm1 subject to writemask k1.

EVEX.128.66.0F38.W0 59 /r

VBROADCASTI32x2 xmm1 {k1}{z}, xmm2/m64

C

V/V

AVX512VL AVX512DQ

Broadcast two dword elements in source operand to locations in xmm1 subject to writemask k1.


Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.256.66.0F38.W0 59 /r

VBROADCASTI32x2 ymm1 {k1}{z}, xmm2/m64

C

V/V

AVX512VL AVX512DQ

Broadcast two dword elements in source operand to locations in ymm1 subject to writemask k1.

EVEX.512.66.0F38.W0 59 /r

VBROADCASTI32x2 zmm1 {k1}{z}, xmm2/m64

C

V/V

AVX512DQ

Broadcast two dword elements in source operand to locations in zmm1 subject to writemask k1.

VEX.256.66.0F38.W0 5A /r VBROADCASTI128 ymm1, m128

A

V/V

AVX2

Broadcast 128 bits of integer data in mem to low and high 128-bits in ymm1.

EVEX.256.66.0F38.W0 5A /r VBROADCASTI32X4 ymm1 {k1}{z}, m128

D

V/V

AVX512VL AVX512F

Broadcast 128 bits of 4 doubleword integer data in mem to locations in ymm1 using writemask k1.

EVEX.512.66.0F38.W0 5A /r VBROADCASTI32X4 zmm1 {k1}{z}, m128

D

V/V

AVX512F

Broadcast 128 bits of 4 doubleword integer data in mem to locations in zmm1 using writemask k1.

EVEX.256.66.0F38.W1 5A /r VBROADCASTI64X2 ymm1 {k1}{z}, m128

C

V/V

AVX512VL AVX512DQ

Broadcast 128 bits of 2 quadword integer data in mem to locations in ymm1 using writemask k1.

EVEX.512.66.0F38.W1 5A /r VBROADCASTI64X2 zmm1 {k1}{z}, m128

C

V/V

AVX512DQ

Broadcast 128 bits of 2 quadword integer data in mem to locations in zmm1 using writemask k1.

EVEX.512.66.0F38.W0 5B /r VBROADCASTI32X8 zmm1 {k1}{z}, m256

E

V/V

AVX512DQ

Broadcast 256 bits of 8 doubleword integer data in mem to locations in zmm1 using writemask k1.

EVEX.512.66.0F38.W1 5B /r VBROADCASTI64X4 zmm1 {k1}{z}, m256

D

V/V

AVX512F

Broadcast 256 bits of 4 quadword integer data in mem to locations in zmm1 using writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

B

Tuple1 Scalar

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

C

Tuple2

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

D

Tuple4

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

E

Tuple8

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Load integer data from the source operand (the second operand) and broadcast to all elements of the destination operand (the first operand).

VEX256-encoded VPBROADCASTB/W/D/Q: The source operand is 8-bit, 16-bit, 32-bit, 64-bit memory location or the low 8-bit, 16-bit 32-bit, 64-bit data in an XMM register. The destination operand is a YMM register.

VPBROADCASTI128 support the source operand of 128-bit memory location. Register source encodings for VPBROADCASTI128 is reserved and will #UD. Bits (MAXVL-1:256) of the destination register are zeroed.

EVEX-encoded VPBROADCASTD/Q: The source operand is a 32-bit, 64-bit memory location or the low 32-bit, 64- bit data in an XMM register. The destination operand is a ZMM/YMM/XMM register and updated according to the writemask k1.

VPBROADCASTI32X4 and VPBROADCASTI64X4: The destination operand is a ZMM register and updated according to the writemask k1. The source operand is 128-bit or 256-bit memory location. Register source encodings for VBROADCASTI32X4 and VBROADCASTI64X4 are reserved and will #UD.



Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

If VPBROADCASTI128 is encoded with VEX.L= 0, an attempt to execute the instruction encoded with VEX.L= 0 will cause an #UD exception.



image

X0

m32


X0

X0

X0

X0

X0

X0

X0

X0

DEST


Figure 5-16. VPBROADCASTD Operation (VEX.256 encoded version)


image

X0

m32


0

0

0

0

X0

X0

X0

X0

DEST


Figure 5-17. VPBROADCASTD Operation (128-bit version)


image

X0

m64


X0

X0

X0

X0

DEST


Figure 5-18. VPBROADCASTQ Operation (256-bit version)


m128

X0



DEST

X0

X0


image

image

Figure 5-19. VBROADCASTI128 Operation (256-bit version)


m256

X0



DEST

X0

X0


Figure 5-20. VBROADCASTI256 Operation (512-bit version)



Operation

VPBROADCASTB (EVEX encoded versions)

(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j 0 TO KL-1

i j * 8

IF k1[j] OR *no writemask*

THEN DEST[i+7:i] SRC[7:0] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+7:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+7:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VPBROADCASTW (EVEX encoded versions)

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j 0 TO KL-1

i j * 16

IF k1[j] OR *no writemask*

THEN DEST[i+15:i] SRC[15:0] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+15:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VPBROADCASTD (128 bit version)

temp SRC[31:0] DEST[31:0] temp DEST[63:32] temp DEST[95:64] temp DEST[127:96] temp DEST[MAXVL-1:128] 0


VPBROADCASTD (VEX.256 encoded version)

temp SRC[31:0] DEST[31:0] temp DEST[63:32] temp DEST[95:64] temp DEST[127:96] temp DEST[159:128] temp DEST[191:160] temp DEST[223:192] temp DEST[255:224] temp DEST[MAXVL-1:256] 0


VPBROADCASTD (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] SRC[31:0] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VPBROADCASTQ (VEX.256 encoded version)

temp SRC[63:0] DEST[63:0] temp DEST[127:64] temp DEST[191:128] temp DEST[255:192] temp DEST[MAXVL-1:256] 0


VPBROADCASTQ (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[63:0] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0

VBROADCASTI32x2 (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

n (j mod 2) * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] SRC[n+31:n] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VBROADCASTI128 (VEX.256 encoded version)

temp SRC[127:0] DEST[127:0] temp DEST[255:128] temp DEST[MAXVL-1:256] 0



VBROADCASTI32X4 (EVEX encoded versions)

(KL, VL) = (8, 256), (16, 512)

FOR j 0 TO KL-1

i j* 32

n (j modulo 4) * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] SRC[n+31:n] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VBROADCASTI64X2 (EVEX encoded versions)

(KL, VL) = (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 64

n (j modulo 2) * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[n+63:n] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] = 0

FI

FI;

ENDFOR;


VBROADCASTI32X8 (EVEX.U1.512 encoded version)

FOR j 0 TO 15

i j * 32

n (j modulo 8) * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] SRC[n+31:n] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VBROADCASTI64X4 (EVEX.512 encoded version)

FOR j 0 TO 7

i j * 64

n (j modulo 4) * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[n+63:n] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


Intel C/C++ Compiler Intrinsic Equivalent

VPBROADCASTB m512i _mm512_broadcastb_epi8( m128i a);

VPBROADCASTB m512i _mm512_mask_broadcastb_epi8( m512i s, mmask64 k, m128i a); VPBROADCASTB m512i _mm512_maskz_broadcastb_epi8( mmask64 k, m128i a); VPBROADCASTB m256i _mm256_broadcastb_epi8( m128i a);

VPBROADCASTB m256i _mm256_mask_broadcastb_epi8( m256i s, mmask32 k, m128i a); VPBROADCASTB m256i _mm256_maskz_broadcastb_epi8( mmask32 k, m128i a); VPBROADCASTB m128i _mm_mask_broadcastb_epi8( m128i s, mmask16 k, m128i a); VPBROADCASTB m128i _mm_maskz_broadcastb_epi8( mmask16 k, m128i a); VPBROADCASTB m128i _mm_broadcastb_epi8( m128i a);

VPBROADCASTD m512i _mm512_broadcastd_epi32( m128i a);

VPBROADCASTD m512i _mm512_mask_broadcastd_epi32( m512i s, mmask16 k, m128i a); VPBROADCASTD m512i _mm512_maskz_broadcastd_epi32( mmask16 k, m128i a); VPBROADCASTD m256i _mm256_broadcastd_epi32( m128i a);

VPBROADCASTD m256i _mm256_mask_broadcastd_epi32( m256i s, mmask8 k, m128i a); VPBROADCASTD m256i _mm256_maskz_broadcastd_epi32( mmask8 k, m128i a); VPBROADCASTD m128i _mm_broadcastd_epi32( m128i a);

VPBROADCASTD m128i _mm_mask_broadcastd_epi32( m128i s, mmask8 k, m128i a); VPBROADCASTD m128i _mm_maskz_broadcastd_epi32( mmask8 k, m128i a); VPBROADCASTQ m512i _mm512_broadcastq_epi64( m128i a);

VPBROADCASTQ m512i _mm512_mask_broadcastq_epi64( m512i s, mmask8 k, m128i a); VPBROADCASTQ m512i _mm512_maskz_broadcastq_epi64( mmask8 k, m128i a); VPBROADCASTQ m256i _mm256_broadcastq_epi64( m128i a);

VPBROADCASTQ m256i _mm256_mask_broadcastq_epi64( m256i s, mmask8 k, m128i a); VPBROADCASTQ m256i _mm256_maskz_broadcastq_epi64( mmask8 k, m128i a); VPBROADCASTQ m128i _mm_broadcastq_epi64( m128i a);

VPBROADCASTQ m128i _mm_mask_broadcastq_epi64( m128i s, mmask8 k, m128i a); VPBROADCASTQ m128i _mm_maskz_broadcastq_epi64( mmask8 k, m128i a); VPBROADCASTW m512i _mm512_broadcastw_epi16( m128i a);

VPBROADCASTW m512i _mm512_mask_broadcastw_epi16( m512i s, mmask32 k, m128i a); VPBROADCASTW m512i _mm512_maskz_broadcastw_epi16( mmask32 k, m128i a); VPBROADCASTW m256i _mm256_broadcastw_epi16( m128i a);

VPBROADCASTW m256i _mm256_mask_broadcastw_epi16( m256i s, mmask16 k, m128i a); VPBROADCASTW m256i _mm256_maskz_broadcastw_epi16( mmask16 k, m128i a); VPBROADCASTW m128i _mm_broadcastw_epi16( m128i a);

VPBROADCASTW m128i _mm_mask_broadcastw_epi16( m128i s, mmask8 k, m128i a); VPBROADCASTW m128i _mm_maskz_broadcastw_epi16( mmask8 k, m128i a); VBROADCASTI32x2 m512i _mm512_broadcast_i32x2( m128i a);



VBROADCASTI32x2 m512i _mm512_mask_broadcast_i32x2( m512i s, mmask16 k, m128i a); VBROADCASTI32x2 m512i _mm512_maskz_broadcast_i32x2( mmask16 k, m128i a); VBROADCASTI32x2 m256i _mm256_broadcast_i32x2( m128i a);

VBROADCASTI32x2 m256i _mm256_mask_broadcast_i32x2( m256i s, mmask8 k, m128i a); VBROADCASTI32x2 m256i _mm256_maskz_broadcast_i32x2( mmask8 k, m128i a); VBROADCASTI32x2 m128i _mm_broadcast_i32x2( m128i a);

VBROADCASTI32x2 m128i _mm_mask_broadcast_i32x2( m128i s, mmask8 k, m128i a); VBROADCASTI32x2 m128i _mm_maskz_broadcast_i32x2( mmask8 k, m128i a); VBROADCASTI32x4 m512i _mm512_broadcast_i32x4( m128i a);

VBROADCASTI32x4 m512i _mm512_mask_broadcast_i32x4( m512i s, mmask16 k, m128i a); VBROADCASTI32x4 m512i _mm512_maskz_broadcast_i32x4( mmask16 k, m128i a); VBROADCASTI32x4 m256i _mm256_broadcast_i32x4( m128i a);

VBROADCASTI32x4 m256i _mm256_mask_broadcast_i32x4( m256i s, mmask8 k, m128i a); VBROADCASTI32x4 m256i _mm256_maskz_broadcast_i32x4( mmask8 k, m128i a); VBROADCASTI32x8 m512i _mm512_broadcast_i32x8( m256i a);

VBROADCASTI32x8 m512i _mm512_mask_broadcast_i32x8( m512i s, mmask16 k, m256i a); VBROADCASTI32x8 m512i _mm512_maskz_broadcast_i32x8( mmask16 k, m256i a); VBROADCASTI64x2 m512i _mm512_broadcast_i64x2( m128i a);

VBROADCASTI64x2 m512i _mm512_mask_broadcast_i64x2( m512i s, mmask8 k, m128i a); VBROADCASTI64x2 m512i _mm512_maskz_broadcast_i64x2( mmask8 k, m128i a); VBROADCASTI64x2 m256i _mm256_broadcast_i64x2( m128i a);

VBROADCASTI64x2 m256i _mm256_mask_broadcast_i64x2( m256i s, mmask8 k, m128i a); VBROADCASTI64x2 m256i _mm256_maskz_broadcast_i64x2( mmask8 k, m128i a); VBROADCASTI64x4 m512i _mm512_broadcast_i64x4( m256i a);

VBROADCASTI64x4 m512i _mm512_mask_broadcast_i64x4( m512i s, mmask8 k, m256i a); VBROADCASTI64x4 m512i _mm512_maskz_broadcast_i64x4( mmask8 k, m256i a);


SIMD Floating-Point Exceptions

None


Other Exceptions

EVEX-encoded instructions, see Exceptions Type 6;

EVEX-encoded instructions, syntax with reg/mem operand, see Exceptions Type E6.

#UD If VEX.L = 0 for VPBROADCASTQ, VPBROADCASTI128. If EVEX.L’L = 0 for VBROADCASTI32X4/VBROADCASTI64X2.

If EVEX.L’L < 10b for VBROADCASTI32X8/VBROADCASTI64X4.


VPBROADCASTM—Broadcast Mask to Vector Register

Opcode/ Instruction

Op/ En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.F3.0F38.W1 2A /r VPBROADCASTMB2Q xmm1, k1

RM

V/V

AVX512VL AVX512CD

Broadcast low byte value in k1 to two locations in xmm1.

EVEX.256.F3.0F38.W1 2A /r VPBROADCASTMB2Q ymm1, k1

RM

V/V

AVX512VL AVX512CD

Broadcast low byte value in k1 to four locations in ymm1.

EVEX.512.F3.0F38.W1 2A /r VPBROADCASTMB2Q zmm1, k1

RM

V/V

AVX512CD

Broadcast low byte value in k1 to eight locations in zmm1.

EVEX.128.F3.0F38.W0 3A /r VPBROADCASTMW2D xmm1, k1

RM

V/V

AVX512VL AVX512CD

Broadcast low word value in k1 to four locations in xmm1.

EVEX.256.F3.0F38.W0 3A /r VPBROADCASTMW2D ymm1, k1

RM

V/V

AVX512VL AVX512CD

Broadcast low word value in k1 to eight locations in ymm1.

EVEX.512.F3.0F38.W0 3A /r VPBROADCASTMW2D zmm1, k1

RM

V/V

AVX512CD

Broadcast low word value in k1 to sixteen locations in zmm1.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

RM

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Broadcasts the zero-extended 64/32 bit value of the low byte/word of the source operand (the second operand) to each 64/32 bit element of the destination operand (the first operand). The source operand is an opmask register. The destination operand is a ZMM register (EVEX.512), YMM register (EVEX.256), or XMM register (EVEX.128).

EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation VPBROADCASTMB2Q

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j*64

DEST[i+63:i] ZeroExtend(SRC[7:0]) ENDFOR

DEST[MAXVL-1:VL] 0


VPBROADCASTMW2D

(KL, VL) = (4, 128), (8, 256),(16, 512)

FOR j 0 TO KL-1

i j*32

DEST[i+31:i] ZeroExtend(SRC[15:0]) ENDFOR

DEST[MAXVL-1:VL] 0



Intel C/C++ Compiler Intrinsic Equivalent

VPBROADCASTMB2Q m512i _mm512_broadcastmb_epi64( mmask8); VPBROADCASTMW2D m512i _mm512_broadcastmw_epi32( mmask16); VPBROADCASTMB2Q m256i _mm256_broadcastmb_epi64( mmask8); VPBROADCASTMW2D m256i _mm256_broadcastmw_epi32( mmask8); VPBROADCASTMB2Q m128i _mm_broadcastmb_epi64( mmask8); VPBROADCASTMW2D m128i _mm_broadcastmw_epi32( mmask8);

SIMD Floating-Point Exceptions

None


Other Exceptions

EVEX-encoded instruction, see Exceptions Type E6NF.


VPCMPB/VPCMPUB—Compare Packed Byte Values Into Mask

Opcode/ Instruction

Op/ En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.NDS.128.66.0F3A.W0 3F /r ib


VPCMPB k1 {k2}, xmm2,

xmm3/m128, imm8

A

V/V

AVX512VL AVX512BW

Compare packed signed byte values in xmm3/m128 and xmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.

EVEX.NDS.256.66.0F3A.W0 3F /r ib


VPCMPB k1 {k2}, ymm2,

ymm3/m256, imm8

A

V/V

AVX512VL AVX512BW

Compare packed signed byte values in ymm3/m256 and ymm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.

EVEX.NDS.512.66.0F3A.W0 3F /r ib VPCMPB k1 {k2}, zmm2,

zmm3/m512, imm8

A

V/V

AVX512BW

Compare packed signed byte values in zmm3/m512 and zmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.

EVEX.NDS.128.66.0F3A.W0 3E /r ib


VPCMPUB k1 {k2}, xmm2,

xmm3/m128, imm8

A

V/V

AVX512VL AVX512BW

Compare packed unsigned byte values in xmm3/m128 and xmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.

EVEX.NDS.256.66.0F3A.W0 3E /r ib


VPCMPUB k1 {k2}, ymm2,

ymm3/m256, imm8

A

V/V

AVX512VL AVX512BW

Compare packed unsigned byte values in ymm3/m256 and ymm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.

EVEX.NDS.512.66.0F3A.W0 3E /r ib VPCMPUB k1 {k2}, zmm2,

zmm3/m512, imm8

A

V/V

AVX512BW

Compare packed unsigned byte values in zmm3/m512 and zmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full Mem

ModRM:reg (w)

vvvv (r)

ModRM:r/m (r)

NA

Description

Performs a SIMD compare of the packed byte values in the second source operand and the first source operand and returns the results of the comparison to the mask destination operand. The comparison predicate operand (imme- diate byte) specifies the type of comparison performed on each pair of packed values in the two source operands. The result of each comparison is a single mask bit result of 1 (comparison true) or 0 (comparison false).

VPCMPB performs a comparison between pairs of signed byte values. VPCMPUB performs a comparison between pairs of unsigned byte values.

The first source operand (second operand) is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand (first operand) is a mask register k1. Up to 64/32/16 comparisons are performed with results written to the destination operand under the writemask k2.



The comparison predicate operand is an 8-bit immediate: bits 2:0 define the type of comparison to be performed. Bits 3 through 7 of the immediate are reserved. Compiler can implement the pseudo-op mnemonic listed in Table 5-8.



Pseudo-Op

PCMPM Implementation

VPCMPEQ* reg1, reg2, reg3

VPCMP* reg1, reg2, reg3, 0

VPCMPLT* reg1, reg2, reg3

VPCMP*reg1, reg2, reg3, 1

VPCMPLE* reg1, reg2, reg3

VPCMP* reg1, reg2, reg3, 2

VPCMPNEQ* reg1, reg2, reg3

VPCMP* reg1, reg2, reg3, 4

VPPCMPNLT* reg1, reg2, reg3

VPCMP* reg1, reg2, reg3, 5

VPCMPNLE* reg1, reg2, reg3

VPCMP* reg1, reg2, reg3, 6

:


Operation

CASE (COMPARISON PREDICATE) OF 0: OP EQ;

1: OP LT;

2: OP LE;

3: OP FALSE;

4: OP NEQ;

5: OP NLT;

6: OP NLE;

7: OP TRUE; ESAC;

Table 5-8. Pseudo-Op and VPCMP* Implementation


VPCMPB (EVEX encoded versions)

(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j 0 TO KL-1

i j * 8

IF k2[j] OR *no writemask* THEN

CMP SRC1[i+7:i] OP SRC2[i+7:i]; IF CMP = TRUE

THEN DEST[j] 1; ELSE DEST[j] 0; FI;

ELSE DEST[j] = 0 ; zeroing-masking onlyFI;

FI;

ENDFOR

DEST[MAX_KL-1:KL] 0



VPCMPUB (EVEX encoded versions)

(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j 0 TO KL-1

i j * 8

IF k2[j] OR *no writemask* THEN

CMP SRC1[i+7:i] OP SRC2[i+7:i]; IF CMP = TRUE

THEN DEST[j] 1; ELSE DEST[j] 0; FI;

ELSE DEST[j] = 0 ; zeroing-masking onlyFI;

FI; ENDFOR

DEST[MAX_KL-1:KL] 0


Intel C/C++ Compiler Intrinsic Equivalent

VPCMPB mmask64 _mm512_cmp_epi8_mask( m512i a, m512i b, int cmp);

VPCMPB mmask64 _mm512_mask_cmp_epi8_mask( mmask64 m, m512i a, m512i b, int cmp); VPCMPB mmask32 _mm256_cmp_epi8_mask( m256i a, m256i b, int cmp);

VPCMPB mmask32 _mm256_mask_cmp_epi8_mask( mmask32 m, m256i a, m256i b, int cmp); VPCMPB mmask16 _mm_cmp_epi8_mask( m128i a, m128i b, int cmp);

VPCMPB mmask16 _mm_mask_cmp_epi8_mask( mmask16 m, m128i a, m128i b, int cmp); VPCMPB mmask64 _mm512_cmp[eq|ge|gt|le|lt|neq]_epi8_mask( m512i a, m512i b);

VPCMPB mmask64 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epi8_mask( mmask64 m, m512i a, m512i b); VPCMPB mmask32 _mm256_cmp[eq|ge|gt|le|lt|neq]_epi8_mask( m256i a, m256i b);

VPCMPB mmask32 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epi8_mask( mmask32 m, m256i a, m256i b); VPCMPB mmask16 _mm_cmp[eq|ge|gt|le|lt|neq]_epi8_mask( m128i a, m128i b);

VPCMPB mmask16 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epi8_mask( mmask16 m, m128i a, m128i b); VPCMPUB mmask64 _mm512_cmp_epu8_mask( m512i a, m512i b, int cmp);

VPCMPUB mmask64 _mm512_mask_cmp_epu8_mask( mmask64 m, m512i a, m512i b, int cmp); VPCMPUB mmask32 _mm256_cmp_epu8_mask( m256i a, m256i b, int cmp);

VPCMPUB mmask32 _mm256_mask_cmp_epu8_mask( mmask32 m, m256i a, m256i b, int cmp); VPCMPUB mmask16 _mm_cmp_epu8_mask( m128i a, m128i b, int cmp);

VPCMPUB mmask16 _mm_mask_cmp_epu8_mask( mmask16 m, m128i a, m128i b, int cmp); VPCMPUB mmask64 _mm512_cmp[eq|ge|gt|le|lt|neq]_epu8_mask( m512i a, m512i b, int cmp);

VPCMPUB mmask64 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epu8_mask( mmask64 m, m512i a, m512i b, int cmp); VPCMPUB mmask32 _mm256_cmp[eq|ge|gt|le|lt|neq]_epu8_mask( m256i a, m256i b, int cmp);

VPCMPUB mmask32 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epu8_mask( mmask32 m, m256i a, m256i b, int cmp); VPCMPUB mmask16 _mm_cmp[eq|ge|gt|le|lt|neq]_epu8_mask( m128i a, m128i b, int cmp);

VPCMPUB mmask16 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epu8_mask( mmask16 m, m128i a, m128i b, int cmp);


SIMD Floating-Point Exceptions

None


Other Exceptions

EVEX-encoded instruction, see Exceptions Type E4.nb.


VPCMPD/VPCMPUD—Compare Packed Integer Values into Mask

Opcode/ Instruction

Op/ En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.NDS.128.66.0F3A.W0 1F /r ib VPCMPD k1 {k2}, xmm2,

xmm3/m128/m32bcst, imm8

A

V/V

AVX512VL AVX512F

Compare packed signed doubleword integer values in xmm3/m128/m32bcst and xmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.

EVEX.NDS.256.66.0F3A.W0 1F /r ib VPCMPD k1 {k2}, ymm2,

ymm3/m256/m32bcst, imm8

A

V/V

AVX512VL AVX512F

Compare packed signed doubleword integer values in ymm3/m256/m32bcst and ymm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.

EVEX.NDS.512.66.0F3A.W0 1F /r ib VPCMPD k1 {k2}, zmm2,

zmm3/m512/m32bcst, imm8

A

V/V

AVX512F

Compare packed signed doubleword integer values in zmm2 and zmm3/m512/m32bcst using bits 2:0 of imm8 as a comparison predicate. The comparison results are written to the destination k1 under writemask k2.

EVEX.NDS.128.66.0F3A.W0 1E /r ib VPCMPUD k1 {k2}, xmm2,

xmm3/m128/m32bcst, imm8

A

V/V

AVX512VL AVX512F

Compare packed unsigned doubleword integer values in xmm3/m128/m32bcst and xmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.

EVEX.NDS.256.66.0F3A.W0 1E /r ib VPCMPUD k1 {k2}, ymm2,

ymm3/m256/m32bcst, imm8

A

V/V

AVX512VL AVX512F

Compare packed unsigned doubleword integer values in ymm3/m256/m32bcst and ymm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.

EVEX.NDS.512.66.0F3A.W0 1E /r ib VPCMPUD k1 {k2}, zmm2,

zmm3/m512/m32bcst, imm8

A

V/V

AVX512F

Compare packed unsigned doubleword integer values in zmm2 and zmm3/m512/m32bcst using bits 2:0 of imm8 as a comparison predicate. The comparison results are written to the destination k1 under writemask k2.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full

ModRM:reg (w)

EVEX.vvvv (r)

ModRM:r/m (r)

Imm8

Description

Performs a SIMD compare of the packed integer values in the second source operand and the first source operand and returns the results of the comparison to the mask destination operand. The comparison predicate operand (immediate byte) specifies the type of comparison performed on each pair of packed values in the two source oper- ands. The result of each comparison is a single mask bit result of 1 (comparison true) or 0 (comparison false).

VPCMPD/VPCMPUD performs a comparison between pairs of signed/unsigned doubleword integer values. The first source operand (second operand) is a ZMM/YMM/XMM register. The second source operand can be a

ZMM/YMM/XMM register or a 512/256/128-bit memory location or a 512-bit vector broadcasted from a 32-bit

memory location. The destination operand (first operand) is a mask register k1. Up to 16/8/4 comparisons are performed with results written to the destination operand under the writemask k2.

The comparison predicate operand is an 8-bit immediate: bits 2:0 define the type of comparison to be performed. Bits 3 through 7 of the immediate are reserved. Compiler can implement the pseudo-op mnemonic listed in Table 5-8.



Operation

CASE (COMPARISON PREDICATE) OF 0: OP EQ;

1: OP LT;

2: OP LE;

3: OP FALSE;

4: OP NEQ;

5: OP NLT;

6: OP NLE;

7: OP TRUE;

ESAC;


VPCMPD (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k2[j] OR *no writemask* THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*)

THEN CMP SRC1[i+31:i] OP SRC2[31:0]; ELSE CMP SRC1[i+31:i] OP SRC2[i+31:i];


FI;

FI;

IF CMP = TRUE

THEN DEST[j] 1; ELSE DEST[j] 0; FI;

ELSE DEST[j] 0 ; zeroing-masking onlyFI;

ENDFOR

DEST[MAX_KL-1:KL] 0


VPCMPUD (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k2[j] OR *no writemask* THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*)

THEN CMP SRC1[i+31:i] OP SRC2[31:0]; ELSE CMP SRC1[i+31:i] OP SRC2[i+31:i];


FI;

FI;

IF CMP = TRUE

THEN DEST[j] 1; ELSE DEST[j] 0; FI;

ELSE DEST[j] 0 ; zeroing-masking onlyFI;

ENDFOR

DEST[MAX_KL-1:KL] 0



Intel C/C++ Compiler Intrinsic Equivalent

VPCMPD mmask16 _mm512_cmp_epi32_mask( m512i a, m512i b, int imm);

VPCMPD mmask16 _mm512_mask_cmp_epi32_mask( mmask16 k, m512i a, m512i b, int imm); VPCMPD mmask16 _mm512_cmp[eq|ge|gt|le|lt|neq]_epi32_mask( m512i a, m512i b);

VPCMPD mmask16 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epi32_mask( mmask16 k, m512i a, m512i b); VPCMPUD mmask16 _mm512_cmp_epu32_mask( m512i a, m512i b, int imm);

VPCMPUD mmask16 _mm512_mask_cmp_epu32_mask( mmask16 k, m512i a, m512i b, int imm); VPCMPUD mmask16 _mm512_cmp[eq|ge|gt|le|lt|neq]_epu32_mask( m512i a, m512i b);

VPCMPUD mmask16 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epu32_mask( mmask16 k, m512i a, m512i b); VPCMPD mmask8 _mm256_cmp_epi32_mask( m256i a, m256i b, int imm);

VPCMPD mmask8 _mm256_mask_cmp_epi32_mask( mmask8 k, m256i a, m256i b, int imm); VPCMPD mmask8 _mm256_cmp[eq|ge|gt|le|lt|neq]_epi32_mask( m256i a, m256i b);

VPCMPD mmask8 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epi32_mask( mmask8 k, m256i a, m256i b); VPCMPUD mmask8 _mm256_cmp_epu32_mask( m256i a, m256i b, int imm);

VPCMPUD mmask8 _mm256_mask_cmp_epu32_mask( mmask8 k, m256i a, m256i b, int imm); VPCMPUD mmask8 _mm256_cmp[eq|ge|gt|le|lt|neq]_epu32_mask( m256i a, m256i b);

VPCMPUD mmask8 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epu32_mask( mmask8 k, m256i a, m256i b); VPCMPD mmask8 _mm_cmp_epi32_mask( m128i a, m128i b, int imm);

VPCMPD mmask8 _mm_mask_cmp_epi32_mask( mmask8 k, m128i a, m128i b, int imm); VPCMPD mmask8 _mm_cmp[eq|ge|gt|le|lt|neq]_epi32_mask( m128i a, m128i b);

VPCMPD mmask8 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epi32_mask( mmask8 k, m128i a, m128i b); VPCMPUD mmask8 _mm_cmp_epu32_mask( m128i a, m128i b, int imm);

VPCMPUD mmask8 _mm_mask_cmp_epu32_mask( mmask8 k, m128i a, m128i b, int imm); VPCMPUD mmask8 _mm_cmp[eq|ge|gt|le|lt|neq]_epu32_mask( m128i a, m128i b);

VPCMPUD mmask8 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epu32_mask( mmask8 k, m128i a, m128i b);


SIMD Floating-Point Exceptions

None


Other Exceptions

EVEX-encoded instruction, see Exceptions Type E4.


VPCMPQ/VPCMPUQ—Compare Packed Integer Values into Mask

Opcode/ Instruction

Op/ En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.NDS.128.66.0F3A.W1 1F /r ib VPCMPQ k1 {k2}, xmm2,

xmm3/m128/m64bcst, imm8

A

V/V

AVX512VL AVX512F

Compare packed signed quadword integer values in xmm3/m128/m64bcst and xmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.

EVEX.NDS.256.66.0F3A.W1 1F /r ib VPCMPQ k1 {k2}, ymm2,

ymm3/m256/m64bcst, imm8

A

V/V

AVX512VL AVX512F

Compare packed signed quadword integer values in ymm3/m256/m64bcst and ymm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.

EVEX.NDS.512.66.0F3A.W1 1F /r ib VPCMPQ k1 {k2}, zmm2,

zmm3/m512/m64bcst, imm8

A

V/V

AVX512F

Compare packed signed quadword integer values in zmm3/m512/m64bcst and zmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.

EVEX.NDS.128.66.0F3A.W1 1E /r ib VPCMPUQ k1 {k2}, xmm2,

xmm3/m128/m64bcst, imm8

A

V/V

AVX512VL AVX512F

Compare packed unsigned quadword integer values in xmm3/m128/m64bcst and xmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.

EVEX.NDS.256.66.0F3A.W1 1E /r ib VPCMPUQ k1 {k2}, ymm2,

ymm3/m256/m64bcst, imm8

A

V/V

AVX512VL AVX512F

Compare packed unsigned quadword integer values in ymm3/m256/m64bcst and ymm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.

EVEX.NDS.512.66.0F3A.W1 1E /r ib VPCMPUQ k1 {k2}, zmm2,

zmm3/m512/m64bcst, imm8

A

V/V

AVX512F

Compare packed unsigned quadword integer values in zmm3/m512/m64bcst and zmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full

ModRM:reg (w)

EVEX.vvvv (r)

ModRM:r/m (r)

Imm8

Description

Performs a SIMD compare of the packed integer values in the second source operand and the first source operand and returns the results of the comparison to the mask destination operand. The comparison predicate operand (immediate byte) specifies the type of comparison performed on each pair of packed values in the two source oper- ands. The result of each comparison is a single mask bit result of 1 (comparison true) or 0 (comparison false).

VPCMPQ/VPCMPUQ performs a comparison between pairs of signed/unsigned quadword integer values.

The first source operand (second operand) is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location or a 512-bit vector broadcasted from a 64-bit memory location. The destination operand (first operand) is a mask register k1. Up to 8/4/2 comparisons are performed with results written to the destination operand under the writemask k2.

The comparison predicate operand is an 8-bit immediate: bits 2:0 define the type of comparison to be performed. Bits 3 through 7 of the immediate are reserved. Compiler can implement the pseudo-op mnemonic listed in Table 5-8.



Operation

CASE (COMPARISON PREDICATE) OF 0: OP EQ;

1: OP LT;

2: OP LE;

3: OP FALSE;

4: OP NEQ;

5: OP NLT;

6: OP NLE;

7: OP TRUE;

ESAC;


VPCMPQ (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k2[j] OR *no writemask* THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*)

THEN CMP SRC1[i+63:i] OP SRC2[63:0]; ELSE CMP SRC1[i+63:i] OP SRC2[i+63:i];


FI;

FI;

IF CMP = TRUE

THEN DEST[j] 1; ELSE DEST[j] 0; FI;

ELSE DEST[j] 0 ; zeroing-masking only

ENDFOR

DEST[MAX_KL-1:KL] 0


VPCMPUQ (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k2[j] OR *no writemask* THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*)

THEN CMP SRC1[i+63:i] OP SRC2[63:0]; ELSE CMP SRC1[i+63:i] OP SRC2[i+63:i];


FI;

FI;

IF CMP = TRUE

THEN DEST[j] 1; ELSE DEST[j] 0; FI;

ELSE DEST[j] 0 ; zeroing-masking only

ENDFOR

DEST[MAX_KL-1:KL] 0



Intel C/C++ Compiler Intrinsic Equivalent

VPCMPQ mmask8 _mm512_cmp_epi64_mask( m512i a, m512i b, int imm);

VPCMPQ mmask8 _mm512_mask_cmp_epi64_mask( mmask8 k, m512i a, m512i b, int imm); VPCMPQ mmask8 _mm512_cmp[eq|ge|gt|le|lt|neq]_epi64_mask( m512i a, m512i b);

VPCMPQ mmask8 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epi64_mask( mmask8 k, m512i a, m512i b); VPCMPUQ mmask8 _mm512_cmp_epu64_mask( m512i a, m512i b, int imm);

VPCMPUQ mmask8 _mm512_mask_cmp_epu64_mask( mmask8 k, m512i a, m512i b, int imm); VPCMPUQ mmask8 _mm512_cmp[eq|ge|gt|le|lt|neq]_epu64_mask( m512i a, m512i b);

VPCMPUQ mmask8 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epu64_mask( mmask8 k, m512i a, m512i b); VPCMPQ mmask8 _mm256_cmp_epi64_mask( m256i a, m256i b, int imm);

VPCMPQ mmask8 _mm256_mask_cmp_epi64_mask( mmask8 k, m256i a, m256i b, int imm); VPCMPQ mmask8 _mm256_cmp[eq|ge|gt|le|lt|neq]_epi64_mask( m256i a, m256i b);

VPCMPQ mmask8 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epi64_mask( mmask8 k, m256i a, m256i b); VPCMPUQ mmask8 _mm256_cmp_epu64_mask( m256i a, m256i b, int imm);

VPCMPUQ mmask8 _mm256_mask_cmp_epu64_mask( mmask8 k, m256i a, m256i b, int imm); VPCMPUQ mmask8 _mm256_cmp[eq|ge|gt|le|lt|neq]_epu64_mask( m256i a, m256i b);

VPCMPUQ mmask8 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epu64_mask( mmask8 k, m256i a, m256i b); VPCMPQ mmask8 _mm_cmp_epi64_mask( m128i a, m128i b, int imm);

VPCMPQ mmask8 _mm_mask_cmp_epi64_mask( mmask8 k, m128i a, m128i b, int imm); VPCMPQ mmask8 _mm_cmp[eq|ge|gt|le|lt|neq]_epi64_mask( m128i a, m128i b);

VPCMPQ mmask8 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epi64_mask( mmask8 k, m128i a, m128i b); VPCMPUQ mmask8 _mm_cmp_epu64_mask( m128i a, m128i b, int imm);

VPCMPUQ mmask8 _mm_mask_cmp_epu64_mask( mmask8 k, m128i a, m128i b, int imm); VPCMPUQ mmask8 _mm_cmp[eq|ge|gt|le|lt|neq]_epu64_mask( m128i a, m128i b);

VPCMPUQ mmask8 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epu64_mask( mmask8 k, m128i a, m128i b);


SIMD Floating-Point Exceptions

None


Other Exceptions

EVEX-encoded instruction, see Exceptions Type E4.


VPCMPW/VPCMPUW—Compare Packed Word Values Into Mask

Opcode/ Instruction

Op/ En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.NDS.128.66.0F3A.W1 3F /r ib


VPCMPW k1 {k2}, xmm2,

xmm3/m128, imm8

A

V/V

AVX512VL AVX512BW

Compare packed signed word integers in xmm3/m128 and xmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.

EVEX.NDS.256.66.0F3A.W1 3F /r ib


VPCMPW k1 {k2}, ymm2,

ymm3/m256, imm8

A

V/V

AVX512VL AVX512BW

Compare packed signed word integers in ymm3/m256 and ymm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.

EVEX.NDS.512.66.0F3A.W1 3F /r ib VPCMPW k1 {k2}, zmm2,

zmm3/m512, imm8

A

V/V

AVX512BW

Compare packed signed word integers in zmm3/m512 and zmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.

EVEX.NDS.128.66.0F3A.W1 3E /r ib


VPCMPUW k1 {k2}, xmm2,

xmm3/m128, imm8

A

V/V

AVX512VL AVX512BW

Compare packed unsigned word integers in xmm3/m128 and xmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.

EVEX.NDS.256.66.0F3A.W1 3E /r ib


VPCMPUW k1 {k2}, ymm2,

ymm3/m256, imm8

A

V/V

AVX512VL AVX512BW

Compare packed unsigned word integers in ymm3/m256 and ymm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.

VPCMPUW k1 {k2}, zmm2,

zmm3/m512, imm8

A

V/V

AVX512BW

Compare packed unsigned word integers in zmm3/m512 and zmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full Mem

ModRM:reg (w)

vvvv (r)

ModRM:r/m (r)

NA

Description

Performs a SIMD compare of the packed integer word in the second source operand and the first source operand and returns the results of the comparison to the mask destination operand. The comparison predicate operand (immediate byte) specifies the type of comparison performed on each pair of packed values in the two source oper- ands. The result of each comparison is a single mask bit result of 1 (comparison true) or 0 (comparison false).

VPCMPW performs a comparison between pairs of signed word values. VPCMPUW performs a comparison between pairs of unsigned word values.

The first source operand (second operand) is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand (first operand) is a mask register k1. Up to 32/16/8 comparisons are performed with results written to the destination operand under the writemask k2.

The comparison predicate operand is an 8-bit immediate: bits 2:0 define the type of comparison to be performed. Bits 3 through 7 of the immediate are reserved. Compiler can implement the pseudo-op mnemonic listed in Table 5-8.



Operation

CASE (COMPARISON PREDICATE) OF 0: OP EQ;

1: OP LT;

2: OP LE;

3: OP FALSE;

4: OP NEQ;

5: OP NLT;

6: OP NLE;

7: OP TRUE;

ESAC;


VPCMPW (EVEX encoded versions)

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j 0 TO KL-1

i j * 16

IF k2[j] OR *no writemask* THEN

ICMP SRC1[i+15:i] OP SRC2[i+15:i]; IF CMP = TRUE

THEN DEST[j] 1; ELSE DEST[j] 0; FI;

ELSE DEST[j] = 0 ; zeroing-masking only

FI;

ENDFOR

DEST[MAX_KL-1:KL] 0


VPCMPUW (EVEX encoded versions)

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j 0 TO KL-1

i j * 16

IF k2[j] OR *no writemask* THEN

CMP SRC1[i+15:i] OP SRC2[i+15:i]; IF CMP = TRUE

THEN DEST[j] 1; ELSE DEST[j] 0; FI;

ELSE DEST[j] = 0 ; zeroing-masking only

FI;

ENDFOR

DEST[MAX_KL-1:KL] 0



Intel C/C++ Compiler Intrinsic Equivalent

VPCMPW mmask32 _mm512_cmp_epi16_mask( m512i a, m512i b, int cmp);

VPCMPW mmask32 _mm512_mask_cmp_epi16_mask( mmask32 m, m512i a, m512i b, int cmp); VPCMPW mmask16 _mm256_cmp_epi16_mask( m256i a, m256i b, int cmp);

VPCMPW mmask16 _mm256_mask_cmp_epi16_mask( mmask16 m, m256i a, m256i b, int cmp); VPCMPW mmask8 _mm_cmp_epi16_mask( m128i a, m128i b, int cmp);

VPCMPW mmask8 _mm_mask_cmp_epi16_mask( mmask8 m, m128i a, m128i b, int cmp); VPCMPW mmask32 _mm512_cmp[eq|ge|gt|le|lt|neq]_epi16_mask( m512i a, m512i b);

VPCMPW mmask32 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epi16_mask( mmask32 m, m512i a, m512i b); VPCMPW mmask16 _mm256_cmp[eq|ge|gt|le|lt|neq]_epi16_mask( m256i a, m256i b);

VPCMPW mmask16 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epi16_mask( mmask16 m, m256i a, m256i b); VPCMPW mmask8 _mm_cmp[eq|ge|gt|le|lt|neq]_epi16_mask( m128i a, m128i b);

VPCMPW mmask8 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epi16_mask( mmask8 m, m128i a, m128i b); VPCMPUW mmask32 _mm512_cmp_epu16_mask( m512i a, m512i b, int cmp);

VPCMPUW mmask32 _mm512_mask_cmp_epu16_mask( mmask32 m, m512i a, m512i b, int cmp); VPCMPUW mmask16 _mm256_cmp_epu16_mask( m256i a, m256i b, int cmp);

VPCMPUW mmask16 _mm256_mask_cmp_epu16_mask( mmask16 m, m256i a, m256i b, int cmp); VPCMPUW mmask8 _mm_cmp_epu16_mask( m128i a, m128i b, int cmp);

VPCMPUW mmask8 _mm_mask_cmp_epu16_mask( mmask8 m, m128i a, m128i b, int cmp); VPCMPUW mmask32 _mm512_cmp[eq|ge|gt|le|lt|neq]_epu16_mask( m512i a, m512i b, int cmp);

VPCMPUW mmask32 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epu16_mask( mmask32 m, m512i a, m512i b, int cmp); VPCMPUW mmask16 _mm256_cmp[eq|ge|gt|le|lt|neq]_epu16_mask( m256i a, m256i b, int cmp);

VPCMPUW mmask16 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epu16_mask( mmask16 m, m256i a, m256i b, int cmp); VPCMPUW mmask8 _mm_cmp[eq|ge|gt|le|lt|neq]_epu16_mask( m128i a, m128i b, int cmp);

VPCMPUW mmask8 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epu16_mask( mmask8 m, m128i a, m128i b, int cmp);


SIMD Floating-Point Exceptions

None


Other Exceptions

EVEX-encoded instruction, see Exceptions Type E4.nb.


VPCOMPRESSD—Store Sparse Packed Doubleword Integer Values into Dense Memory/Register

Opcode/ Instruction

Op/ En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.66.0F38.W0 8B /r

VPCOMPRESSD xmm1/m128 {k1}{z}, xmm2

A

V/V

AVX512VL AVX512F

Compress packed doubleword integer values from xmm2 to xmm1/m128 using controlmask k1.

EVEX.256.66.0F38.W0 8B /r

VPCOMPRESSD ymm1/m256 {k1}{z}, ymm2

A

V/V

AVX512VL AVX512F

Compress packed doubleword integer values from ymm2 to ymm1/m256 using controlmask k1.

EVEX.512.66.0F38.W0 8B /r

VPCOMPRESSD zmm1/m512 {k1}{z}, zmm2

A

V/V

AVX512F

Compress packed doubleword integer values from zmm2 to zmm1/m512 using controlmask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Tuple1 Scalar

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

Description

Compress (store) up to 16/8/4 doubleword integer values from the source operand (second operand) to the desti- nation operand (first operand). The source operand is a ZMM/YMM/XMM register, the destination operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location.

The opmask register k1 selects the active elements (partial vector or possibly non-contiguous if less than 16 active elements) from the source operand to compress into a contiguous vector. The contiguous vector is written to the destination starting from the low element of the destination operand.

Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z must be zero.

Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper bits are zeroed.

Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element instead of the size of the full vector.


Operation

VPCOMPRESSD (EVEX encoded versions) store form

(KL, VL) = (4, 128), (8, 256), (16, 512)

SIZE 32

k 0

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no controlmask* THEN

DEST[k+SIZE-1:k] SRC[i+31:i]

k k + SIZE

FI;

ENDFOR;



VPCOMPRESSD (EVEX encoded versions) reg-reg form

(KL, VL) = (4, 128), (8, 256), (16, 512)

SIZE 32

k 0

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no controlmask* THEN

DEST[k+SIZE-1:k] SRC[i+31:i]

k k + SIZE

FI;

ENDFOR

IF *merging-masking*

THEN *DEST[VL-1:k] remains unchanged*

ELSE DEST[VL-1:k] 0

FI

DEST[MAXVL-1:VL] 0


Intel C/C++ Compiler Intrinsic Equivalent

VPCOMPRESSD m512i _mm512_mask_compress_epi32( m512i s, mmask16 c, m512i a); VPCOMPRESSD m512i _mm512_maskz_compress_epi32( mmask16 c, m512i a); VPCOMPRESSD void _mm512_mask_compressstoreu_epi32(void * a, mmask16 c, m512i s); VPCOMPRESSD m256i _mm256_mask_compress_epi32( m256i s, mmask8 c, m256i a); VPCOMPRESSD m256i _mm256_maskz_compress_epi32( mmask8 c, m256i a); VPCOMPRESSD void _mm256_mask_compressstoreu_epi32(void * a, mmask8 c, m256i s); VPCOMPRESSD m128i _mm_mask_compress_epi32( m128i s, mmask8 c, m128i a); VPCOMPRESSD m128i _mm_maskz_compress_epi32( mmask8 c, m128i a); VPCOMPRESSD void _mm_mask_compressstoreu_epi32(void * a, mmask8 c, m128i s);


SIMD Floating-Point Exceptions

None


Other Exceptions

EVEX-encoded instruction, see Exceptions Type E4.nb.


VPCOMPRESSQ—Store Sparse Packed Quadword Integer Values into Dense Memory/Register

Opcode/ Instruction

Op/ En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.66.0F38.W1 8B /r

VPCOMPRESSQ xmm1/m128 {k1}{z}, xmm2

A

V/V

AVX512VL AVX512F

Compress packed quadword integer values from xmm2 to xmm1/m128 using controlmask k1.

EVEX.256.66.0F38.W1 8B /r

VPCOMPRESSQ ymm1/m256 {k1}{z}, ymm2

A

V/V

AVX512VL AVX512F

Compress packed quadword integer values from ymm2 to ymm1/m256 using controlmask k1.

EVEX.512.66.0F38.W1 8B /r

VPCOMPRESSQ zmm1/m512 {k1}{z}, zmm2

A

V/V

AVX512F

Compress packed quadword integer values from zmm2 to zmm1/m512 using controlmask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Tuple1 Scalar

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

Description

Compress (stores) up to 8/4/2 quadword integer values from the source operand (second operand) to the destina- tion operand (first operand). The source operand is a ZMM/YMM/XMM register, the destination operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location.

The opmask register k1 selects the active elements (partial vector or possibly non-contiguous if less than 8 active elements) from the source operand to compress into a contiguous vector. The contiguous vector is written to the destination starting from the low element of the destination operand.

Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z must be zero.

Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper bits are zeroed.

Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element instead of the size of the full vector.


Operation

VPCOMPRESSQ (EVEX encoded versions) store form

(KL, VL) = (2, 128), (4, 256), (8, 512)

SIZE 64

k 0

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no controlmask* THEN

DEST[k+SIZE-1:k] SRC[i+63:i]

k k + SIZE

FI;

ENFOR



VPCOMPRESSQ (EVEX encoded versions) reg-reg form

(KL, VL) = (2, 128), (4, 256), (8, 512)

SIZE 64

k 0

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no controlmask* THEN

DEST[k+SIZE-1:k] SRC[i+63:i]

k k + SIZE

FI;

ENDFOR

IF *merging-masking*

THEN *DEST[VL-1:k] remains unchanged*

ELSE DEST[VL-1:k] 0

FI

DEST[MAXVL-1:VL] 0


Intel C/C++ Compiler Intrinsic Equivalent

VPCOMPRESSQ m512i _mm512_mask_compress_epi64( m512i s, mmask8 c, m512i a); VPCOMPRESSQ m512i _mm512_maskz_compress_epi64( mmask8 c, m512i a); VPCOMPRESSQ void _mm512_mask_compressstoreu_epi64(void * a, mmask8 c, m512i s); VPCOMPRESSQ m256i _mm256_mask_compress_epi64( m256i s, mmask8 c, m256i a); VPCOMPRESSQ m256i _mm256_maskz_compress_epi64( mmask8 c, m256i a); VPCOMPRESSQ void _mm256_mask_compressstoreu_epi64(void * a, mmask8 c, m256i s); VPCOMPRESSQ m128i _mm_mask_compress_epi64( m128i s, mmask8 c, m128i a); VPCOMPRESSQ m128i _mm_maskz_compress_epi64( mmask8 c, m128i a); VPCOMPRESSQ void _mm_mask_compressstoreu_epi64(void * a, mmask8 c, m128i s);


SIMD Floating-Point Exceptions

None


Other Exceptions

EVEX-encoded instruction, see Exceptions Type E4.nb.


VPCONFLICTD/Q—Detect Conflicts Within a Vector of Packed Dword/Qword Values into Dense Memory/ Register

Opcode/ Instruction

Op/ En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.66.0F38.W0 C4 /r VPCONFLICTD xmm1 {k1}{z},

xmm2/m128/m32bcst

A

V/V

AVX512VL AVX512CD

Detect duplicate double-word values in xmm2/m128/m32bcst using writemask k1.

EVEX.256.66.0F38.W0 C4 /r VPCONFLICTD ymm1 {k1}{z},

ymm2/m256/m32bcst

A

V/V

AVX512VL AVX512CD

Detect duplicate double-word values in ymm2/m256/m32bcst using writemask k1.

EVEX.512.66.0F38.W0 C4 /r VPCONFLICTD zmm1 {k1}{z},

zmm2/m512/m32bcst

A

V/V

AVX512CD

Detect duplicate double-word values in zmm2/m512/m32bcst using writemask k1.

EVEX.128.66.0F38.W1 C4 /r VPCONFLICTQ xmm1 {k1}{z},

xmm2/m128/m64bcst

A

V/V

AVX512VL AVX512CD

Detect duplicate quad-word values in xmm2/m128/m64bcst using writemask k1.

EVEX.256.66.0F38.W1 C4 /r VPCONFLICTQ ymm1 {k1}{z},

ymm2/m256/m64bcst

A

V/V

AVX512VL AVX512CD

Detect duplicate quad-word values in ymm2/m256/m64bcst using writemask k1.

EVEX.512.66.0F38.W1 C4 /r VPCONFLICTQ zmm1 {k1}{z},

zmm2/m512/m64bcst

A

V/V

AVX512CD

Detect duplicate quad-word values in zmm2/m512/m64bcst using writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Test each dword/qword element of the source operand (the second operand) for equality with all other elements in the source operand closer to the least significant element. Each element’s comparison results form a bit vector, which is then zero extended and written to the destination according to the writemask.

EVEX.512 encoded version: The source operand is a ZMM register, a 512-bit memory location, or a 512-bit vector broadcasted from a 32/64-bit memory location. The destination operand is a ZMM register, conditionally updated using writemask k1.

EVEX.256 encoded version: The source operand is a YMM register, a 256-bit memory location, or a 256-bit vector broadcasted from a 32/64-bit memory location. The destination operand is a YMM register, conditionally updated using writemask k1.

EVEX.128 encoded version: The source operand is a XMM register, a 128-bit memory location, or a 128-bit vector broadcasted from a 32/64-bit memory location. The destination operand is a XMM register, conditionally updated using writemask k1.

EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.


Operation VPCONFLICTD

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j*32

IF MaskBit(j) OR *no writemask*THEN FOR k 0 TO j-1

m k*32

IF ((SRC[i+31:i] = SRC[m+31:m])) THEN DEST[i+k] 1

ELSE

DEST[i+k] 0

FI ENDFOR

DEST[i+31:i+j] 0 ELSE

IF *merging-masking* THEN

*DEST[i+31:i] remains unchanged* ELSE

DEST[i+31:i] 0

FI

FI

ENDFOR

DEST[MAXVL-1:VL] 0


VPCONFLICTQ

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j*64

IF MaskBit(j) OR *no writemask*THEN FOR k 0 TO j-1

m k*64

IF ((SRC[i+63:i] = SRC[m+63:m])) THEN DEST[i+k] 1

ELSE

DEST[i+k] 0

FI ENDFOR

DEST[i+63:i+j] 0 ELSE

IF *merging-masking* THEN

*DEST[i+63:i] remains unchanged* ELSE

DEST[i+63:i] 0

FI

FI ENDFOR

DEST[MAXVL-1:VL] 0



Intel C/C++ Compiler Intrinsic Equivalent

VPCONFLICTD m512i _mm512_conflict_epi32( m512i a);

VPCONFLICTD m512i _mm512_mask_conflict_epi32( m512i s, mmask16 m, m512i a); VPCONFLICTD m512i _mm512_maskz_conflict_epi32( mmask16 m, m512i a); VPCONFLICTQ m512i _mm512_conflict_epi64( m512i a);

VPCONFLICTQ m512i _mm512_mask_conflict_epi64( m512i s, mmask8 m, m512i a); VPCONFLICTQ m512i _mm512_maskz_conflict_epi64( mmask8 m, m512i a); VPCONFLICTD m256i _mm256_conflict_epi32( m256i a);

VPCONFLICTD m256i _mm256_mask_conflict_epi32( m256i s, mmask8 m, m256i a); VPCONFLICTD m256i _mm256_maskz_conflict_epi32( mmask8 m, m256i a); VPCONFLICTQ m256i _mm256_conflict_epi64( m256i a);

VPCONFLICTQ m256i _mm256_mask_conflict_epi64( m256i s, mmask8 m, m256i a); VPCONFLICTQ m256i _mm256_maskz_conflict_epi64( mmask8 m, m256i a); VPCONFLICTD m128i _mm_conflict_epi32( m128i a);

VPCONFLICTD m128i _mm_mask_conflict_epi32( m128i s, mmask8 m, m128i a); VPCONFLICTD m128i _mm_maskz_conflict_epi32( mmask8 m, m128i a); VPCONFLICTQ m128i _mm_conflict_epi64( m128i a);

VPCONFLICTQ m128i _mm_mask_conflict_epi64( m128i s, mmask8 m, m128i a); VPCONFLICTQ m128i _mm_maskz_conflict_epi64( mmask8 m, m128i a);

SIMD Floating-Point Exceptions

None


Other Exceptions

EVEX-encoded instruction, see Exceptions Type E4NF.


VPERM2F128 — Permute Floating-Point Values

Opcode/ Instruction

Op/ En

64/32 bit Mode Support

CPUID

Feature Flag

Description

VEX.NDS.256.66.0F3A.W0 06 /r ib

VPERM2F128 ymm1, ymm2, ymm3/m256, imm8

RVMI

V/V

AVX

Permute 128-bit floating-point fields in ymm2 and ymm3/mem using controls from imm8 and store result in ymm1.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

RVMI

ModRM:reg (w)

VEX.vvvv (r)

ModRM:r/m (r)

imm8


Description

Permute 128 bit floating-point-containing fields from the first source operand (second operand) and second source operand (third operand) using bits in the 8-bit immediate and store results in the destination operand (first operand). The first source operand is a YMM register, the second source operand is a YMM register or a 256-bit memory location, and the destination operand is a YMM register.


image

Y0

Y1

SRC2



X0

X1

SRC1


X0, X1, Y0, or Y1

X0, X1, Y0, or Y1

DEST


Figure 5-21. VPERM2F128 Operation


Imm8[1:0] select the source for the first destination 128-bit field, imm8[5:4] select the source for the second destination field. If imm8[3] is set, the low 128-bit field is zeroed. If imm8[7] is set, the high 128-bit field is zeroed.

VEX.L must be 1, otherwise the instruction will #UD.



Operation

VPERM2F128

CASE IMM8[1:0] of

0: DEST[127:0] SRC1[127:0]

1: DEST[127:0] SRC1[255:128]

2: DEST[127:0] SRC2[127:0]

3: DEST[127:0] SRC2[255:128] ESAC


CASE IMM8[5:4] of

0: DEST[255:128] SRC1[127:0]

1: DEST[255:128] SRC1[255:128]

2: DEST[255:128] SRC2[127:0]

3: DEST[255:128] SRC2[255:128] ESAC

IF (imm8[3]) DEST[127:0] 0 FI


IF (imm8[7]) DEST[MAXVL-1:128] 0 FI


Intel C/C++ Compiler Intrinsic Equivalent

VPERM2F128: VPERM2F128: VPERM2F128:

m256 _mm256_permute2f128_ps ( m256 a, m256 b, int control)

m256d _mm256_permute2f128_pd ( m256d a, m256d b, int control)

m256i _mm256_permute2f128_si256 ( m256i a, m256i b, int control)

SIMD Floating-Point Exceptions

None.


Other Exceptions

See Exceptions Type 6; additionally

#UD If VEX.L = 0 If VEX.W = 1.


VPERM2I128 — Permute Integer Values

Opcode/ Instruction

Op/ En

64/32

-bit Mode

CPUID

Feature Flag

Description

VEX.NDS.256.66.0F3A.W0 46 /r ib

VPERM2I128 ymm1, ymm2, ymm3/m256, imm8

RVMI

V/V

AVX2

Permute 128-bit integer data in ymm2 and ymm3/mem using controls from imm8 and store result in ymm1.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

RVMI

ModRM:reg (w)

VEX.vvvv

ModRM:r/m (r)

Imm8


Description

Permute 128 bit integer data from the first source operand (second operand) and second source operand (third operand) using bits in the 8-bit immediate and store results in the destination operand (first operand). The first source operand is a YMM register, the second source operand is a YMM register or a 256-bit memory location, and the destination operand is a YMM register.


image

Y0

Y1

SRC2



X0

X1

SRC1


X0, X1, Y0, or Y1

X0, X1, Y0, or Y1

DEST


Figure 5-22. VPERM2I128 Operation


Imm8[1:0] select the source for the first destination 128-bit field, imm8[5:4] select the source for the second destination field. If imm8[3] is set, the low 128-bit field is zeroed. If imm8[7] is set, the high 128-bit field is zeroed.

VEX.L must be 1, otherwise the instruction will #UD.



Operation

VPERM2I128

CASE IMM8[1:0] of

0: DEST[127:0] SRC1[127:0]

1: DEST[127:0] SRC1[255:128]

2: DEST[127:0] SRC2[127:0]

3: DEST[127:0] SRC2[255:128] ESAC

CASE IMM8[5:4] of

0: DEST[255:128] SRC1[127:0]

1: DEST[255:128] SRC1[255:128]

2: DEST[255:128] SRC2[127:0]

3: DEST[255:128] SRC2[255:128] ESAC

IF (imm8[3]) DEST[127:0] 0 FI


IF (imm8[7]) DEST[255:128] 0 FI


Intel C/C++ Compiler Intrinsic Equivalent

VPERM2I128: m256i _mm256_permute2x128_si256 ( m256i a, m256i b, int control)


SIMD Floating-Point Exceptions

None


Other Exceptions

See Exceptions Type 6; additionally

#UD If VEX.L = 0, If VEX.W = 1.


VPERMB—Permute Packed Bytes Elements

Opcode/ Instruction

Op/ En

64/32

bit Mode Support

CPUID Feature Flag

Description

EVEX.NDS.128.66.0F38.W0 8D /r

VPERMB xmm1 {k1}{z}, xmm2, xmm3/m128

A

V/V

AVX512VL AVX512_VBMI

Permute bytes in xmm3/m128 using byte indexes in xmm2 and store the result in xmm1 using writemask k1.

EVEX.NDS.256.66.0F38.W0 8D /r

VPERMB ymm1 {k1}{z}, ymm2, ymm3/m256

A

V/V

AVX512VL AVX512_VBMI

Permute bytes in ymm3/m256 using byte indexes in ymm2 and store the result in ymm1 using writemask k1.

EVEX.NDS.512.66.0F38.W0 8D /r

VPERMB zmm1 {k1}{z}, zmm2, zmm3/m512

A

V/V

AVX512_VBMI

Permute bytes in zmm3/m512 using byte indexes in zmm2 and store the result in zmm1 using writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full Mem

ModRM:reg (w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA

Description

Copies bytes from the second source operand (the third operand) to the destination operand (the first operand) according to the byte indices in the first source operand (the second operand). Note that this instruction permits a byte in the source operand to be copied to more than one location in the destination operand.

Only the low 6(EVEX.512)/5(EVEX.256)/4(EVEX.128) bits of each byte index is used to select the location of the source byte from the second source operand.

The first source operand is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM reg- ister, a 512/256/128-bit memory location. The destination operand is a ZMM/YMM/XMM register updated at byte granularity by the writemask k1.


Operation

VPERMB (EVEX encoded versions) (KL, VL) = (16, 128), (32, 256), (64, 512) IF VL = 128:

n 3;

ELSE IF VL = 256:

n 4;

ELSE IF VL = 512:

n 5;

FI;

FOR j 0 TO KL-1:

id SRC1[j*8 + n : j*8] ; // location of the source byte IF k1[j] OR *no writemask* THEN

DEST[j*8 + 7: j*8] SRC2[id*8 +7: id*8]; ELSE IF zeroing-masking THEN

DEST[j*8 + 7: j*8] 0;

*ELSE

DEST[j*8 + 7: j*8] remains unchanged*

FI ENDFOR

DEST[MAX_VL-1:VL] 0;


Intel C/C++ Compiler Intrinsic Equivalent

VPERMB m512i _mm512_permutexvar_epi8( m512i idx, m512i a);



VPERMB m512i _mm512_mask_permutexvar_epi8( m512i s, mmask64 k, m512i idx, m512i a); VPERMB m512i _mm512_maskz_permutexvar_epi8( mmask64 k, m512i idx, m512i a);

VPERMB m256i _mm256_permutexvar_epi8( m256i idx, m256i a);

VPERMB m256i _mm256_mask_permutexvar_epi8( m256i s, mmask32 k, m256i idx, m256i a); VPERMB m256i _mm256_maskz_permutexvar_epi8( mmask32 k, m256i idx, m256i a);

VPERMB m128i _mm_permutexvar_epi8( m128i idx, m128i a);

VPERMB m128i _mm_mask_permutexvar_epi8( m128i s, mmask16 k, m128i idx, m128i a); VPERMB m128i _mm_maskz_permutexvar_epi8( mmask16 k, m128i idx, m128i a);


SIMD Floating-Point Exceptions

None.


Other Exceptions

See Exceptions Type E4NF.nb.


VPERMD/VPERMW—Permute Packed Doublewords/Words Elements

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

VEX.NDS.256.66.0F38.W0 36 /r

VPERMD ymm1, ymm2, ymm3/m256

A

V/V

AVX2

Permute doublewords in ymm3/m256 using indices in ymm2 and store the result in ymm1.

EVEX.NDS.256.66.0F38.W0 36 /r

VPERMD ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst

B

V/V

AVX512VL AVX512F

Permute doublewords in ymm3/m256/m32bcst using indexes in ymm2 and store the result in ymm1 using writemask k1.

EVEX.NDS.512.66.0F38.W0 36 /r

VPERMD zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst

B

V/V

AVX512F

Permute doublewords in zmm3/m512/m32bcst using indices in zmm2 and store the result in zmm1 using writemask k1.

EVEX.NDS.128.66.0F38.W1 8D /r


VPERMW xmm1 {k1}{z}, xmm2, xmm3/m128

C

V/V

AVX512VL AVX512BW

Permute word integers in xmm3/m128 using indexes in xmm2 and store the result in xmm1 using writemask k1.

EVEX.NDS.256.66.0F38.W1 8D /r


VPERMW ymm1 {k1}{z}, ymm2, ymm3/m256

C

V/V

AVX512VL AVX512BW

Permute word integers in ymm3/m256 using indexes in ymm2 and store the result in ymm1 using writemask k1.

EVEX.NDS.512.66.0F38.W1 8D /r

VPERMW zmm1 {k1}{z}, zmm2, zmm3/m512

C

V/V

AVX512BW

Permute word integers in zmm3/m512 using indexes in zmm2 and store the result in zmm1 using writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

VEX.vvvv

ModRM:r/m (r)

NA

B

Full

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

NA

C

Full Mem

ModRM:reg (w)

VEX.vvvv

ModRM:r/m (r)

NA

Description

Copies doublewords (or words) from the second source operand (the third operand) to the destination operand (the first operand) according to the indices in the first source operand (the second operand). Note that this instruction permits a doubleword (word) in the source operand to be copied to more than one location in the destination operand.

VEX.256 encoded VPERMD: The first and second operands are YMM registers, the third operand can be a YMM register or memory location. Bits (MAXVL-1:256) of the corresponding destination register are zeroed.

EVEX encoded VPERMD: The first and second operands are ZMM/YMM registers, the third operand can be a ZMM/YMM register, a 512/256-bit memory location or a 512/256-bit vector broadcasted from a 32-bit memory location. The elements in the destination are updated using the writemask k1.

VPERMW: first and second operands are ZMM/YMM/XMM registers, the third operand can be a ZMM/YMM/XMM register, or a 512/256/128-bit memory location. The destination is updated using the writemask k1.

EVEX.128 encoded versions: Bits (MAXVL-1:128) of the corresponding ZMM register are zeroed.



Operation

VPERMD (EVEX encoded versions)

(KL, VL) = (8, 256), (16, 512) IF VL = 256 THEN n 2; FI; IF VL = 512 THEN n 3; FI;

FOR j 0 TO KL-1

i j * 32

id 32*SRC1[i+n:i]

IF k1[j] OR *no writemask* THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[i+31:i] SRC2[31:0]; ELSE DEST[i+31:i] SRC2[id+31:id];

FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VPERMD (VEX.256 encoded version)

DEST[31:0] (SRC2[255:0] >> (SRC1[2:0] * 32))[31:0];

DEST[63:32] (SRC2[255:0] >> (SRC1[34:32] * 32))[31:0];

DEST[95:64] (SRC2[255:0] >> (SRC1[66:64] * 32))[31:0];

DEST[127:96] (SRC2[255:0] >> (SRC1[98:96] * 32))[31:0];

DEST[159:128] (SRC2[255:0] >> (SRC1[130:128] * 32))[31:0];

DEST[191:160] (SRC2[255:0] >> (SRC1[162:160] * 32))[31:0];

DEST[223:192] (SRC2[255:0] >> (SRC1[194:192] * 32))[31:0];

DEST[255:224] (SRC2[255:0] >> (SRC1[226:224] * 32))[31:0]; DEST[MAXVL-1:256] 0


VPERMW (EVEX encoded versions) (KL, VL) = (8, 128), (16, 256), (32, 512) IF VL = 128 THEN n 2; FI;

IF VL = 256 THEN n 3; FI; IF VL = 512 THEN n 4; FI;

FOR j 0 TO KL-1

i j * 16

id 16*SRC1[i+n:i]

IF k1[j] OR *no writemask*

THEN DEST[i+15:i] SRC2[id+15:id] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+15:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VPERMD/VPERMW—Permute Packed Doublewords/Words Elements Vol. 2C 5-341



Intel C/C++ Compiler Intrinsic Equivalent

VPERMD m512i _mm512_permutexvar_epi32( m512i idx, m512i a);

VPERMD m512i _mm512_mask_permutexvar_epi32( m512i s, mmask16 k, m512i idx, m512i a); VPERMD m512i _mm512_maskz_permutexvar_epi32( mmask16 k, m512i idx, m512i a);

VPERMD m256i _mm256_permutexvar_epi32( m256i idx, m256i a);

VPERMD m256i _mm256_mask_permutexvar_epi32( m256i s, mmask8 k, m256i idx, m256i a); VPERMD m256i _mm256_maskz_permutexvar_epi32( mmask8 k, m256i idx, m256i a);

VPERMW m512i _mm512_permutexvar_epi16( m512i idx, m512i a);

VPERMW m512i _mm512_mask_permutexvar_epi16( m512i s, mmask32 k, m512i idx, m512i a); VPERMW m512i _mm512_maskz_permutexvar_epi16( mmask32 k, m512i idx, m512i a);

VPERMW m256i _mm256_permutexvar_epi16( m256i idx, m256i a);

VPERMW m256i _mm256_mask_permutexvar_epi16( m256i s, mmask16 k, m256i idx, m256i a); VPERMW m256i _mm256_maskz_permutexvar_epi16( mmask16 k, m256i idx, m256i a);

VPERMW m128i _mm_permutexvar_epi16( m128i idx, m128i a);

VPERMW m128i _mm_mask_permutexvar_epi16( m128i s, mmask8 k, m128i idx, m128i a); VPERMW m128i _mm_maskz_permutexvar_epi16( mmask8 k, m128i idx, m128i a);


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4. EVEX-encoded VPERMD, see Exceptions Type E4NF. EVEX-encoded VPERMW, see Exceptions Type E4NF.nb.

#UD If VEX.L = 0.

If EVEX.L’L = 0 for VPERMD.


VPERMI2B—Full Permute of Bytes from Two Tables Overwriting the Index

Opcode/ Instruction

Op/ En

64/32

bit Mode Support

CPUID Feature Flag

Description

EVEX.DDS.128.66.0F38.W0 75 /r

VPERMI2B xmm1 {k1}{z}, xmm2, xmm3/m128

A

V/V

AVX512VL AVX512_VBMI

Permute bytes in xmm3/m128 and xmm2 using byte indexes in xmm1 and store the byte results in xmm1 using writemask k1.

EVEX.DDS.256.66.0F38.W0 75 /r

VPERMI2B ymm1 {k1}{z}, ymm2, ymm3/m256

A

V/V

AVX512VL AVX512_VBMI

Permute bytes in ymm3/m256 and ymm2 using byte indexes in ymm1 and store the byte results in ymm1 using writemask k1.

EVEX.DDS.512.66.0F38.W0 75 /r

VPERMI2B zmm1 {k1}{z}, zmm2, zmm3/m512

A

V/V

AVX512_VBMI

Permute bytes in zmm3/m512 and zmm2 using byte indexes in zmm1 and store the byte results in zmm1 using writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full Mem

ModRM:reg (r, w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA

Description

Permutes byte values in the second operand (the first source operand) and the third operand (the second source operand) using the byte indices in the first operand (the destination operand) to select byte elements from the second or third source operands. The selected byte elements are written to the destination at byte granularity under the writemask k1.

The first and second operands are ZMM/YMM/XMM registers. The first operand contains input indices to select elements from the two input tables in the 2nd and 3rd operands. The first operand is also the destination of the result. The third operand can be a ZMM/YMM/XMM register, or a 512/256/128-bit memory location. In each index byte, the id bit for table selection is bit 6/5/4, and bits [5:0]/[4:0]/[3:0] selects element within each input table.

Note that these instructions permit a byte value in the source operands to be copied to more than one location in the destination operand. Also, the same tables can be reused in subsequent iterations, but the index elements are overwritten.

Bits (MAX_VL-1:256/128) of the destination are zeroed for VL=256,128.



Operation

VPERMI2B (EVEX encoded versions) (KL, VL) = (16, 128), (32, 256), (64, 512) IF VL = 128:

id 3;

ELSE IF VL = 256:

id 4;

ELSE IF VL = 512:

id 5;

FI;

TMP_DEST[VL-1:0] DEST[VL-1:0];

FOR j 0 TO KL-1

off 8*SRC1[j*8 + id: j*8] ; IF k1[j] OR *no writemask*:

DEST[j*8 + 7: j*8] TMP_DEST[j*8+id+1]? SRC2[off+7:off] : SRC1[off+7:off]; ELSE IF *zeroing-masking*

DEST[j*8 + 7: j*8] 0;

*ELSE

DEST[j*8 + 7: j*8] remains unchanged*

FI; ENDFOR

DEST[MAX_VL-1:VL] 0;


Intel C/C++ Compiler Intrinsic Equivalent

VPERMI2B m512i _mm512_permutex2var_epi8( m512i a, m512i idx, m512i b);

VPERMI2B m512i _mm512_mask2_permutex2var_epi8( m512i a, m512i idx, mmask64 k, m512i b); VPERMI2B m512i _mm512_maskz_permutex2var_epi8( mmask64 k, m512i a, m512i idx, m512i b); VPERMI2B m256i _mm256_permutex2var_epi8( m256i a, m256i idx, m256i b);

VPERMI2B m256i _mm256_mask2_permutex2var_epi8( m256i a, m256i idx, mmask32 k, m256i b); VPERMI2B m256i _mm256_maskz_permutex2var_epi8( mmask32 k, m256i a, m256i idx, m256i b); VPERMI2B m128i _mm_permutex2var_epi8( m128i a, m128i idx, m128i b);

VPERMI2B m128i _mm_mask2_permutex2var_epi8( m128i a, m128i idx, mmask16 k, m128i b); VPERMI2B m128i _mm_maskz_permutex2var_epi8( mmask16 k, m128i a, m128i idx, m128i b);


SIMD Floating-Point Exceptions

None.


Other Exceptions

See Exceptions Type E4NF.nb.


VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.DDS.128.66.0F38.W1 75 /r

VPERMI2W xmm1 {k1}{z}, xmm2, xmm3/m128

A

V/V

AVX512VL AVX512BW

Permute word integers from two tables in xmm3/m128 and xmm2 using indexes in xmm1 and store the result in xmm1 using writemask k1.

EVEX.DDS.256.66.0F38.W1 75 /r

VPERMI2W ymm1 {k1}{z}, ymm2, ymm3/m256

A

V/V

AVX512VL AVX512BW

Permute word integers from two tables in ymm3/m256 and ymm2 using indexes in ymm1 and store the result in ymm1 using writemask k1.

EVEX.DDS.512.66.0F38.W1 75 /r

VPERMI2W zmm1 {k1}{z}, zmm2, zmm3/m512

A

V/V

AVX512BW

Permute word integers from two tables in zmm3/m512 and zmm2 using indexes in zmm1 and store the result in zmm1 using writemask k1.

EVEX.DDS.128.66.0F38.W0 76 /r

VPERMI2D xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst

B

V/V

AVX512VL AVX512F

Permute double-words from two tables in xmm3/m128/m32bcst and xmm2 using indexes in xmm1 and store the result in xmm1 using writemask k1.

EVEX.DDS.256.66.0F38.W0 76 /r

VPERMI2D ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst

B

V/V

AVX512VL AVX512F

Permute double-words from two tables in ymm3/m256/m32bcst and ymm2 using indexes in ymm1 and store the result in ymm1 using writemask k1.

EVEX.DDS.512.66.0F38.W0 76 /r

VPERMI2D zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst

B

V/V

AVX512F

Permute double-words from two tables in zmm3/m512/m32bcst and zmm2 using indices in zmm1 and store the result in zmm1 using writemask k1.

EVEX.DDS.128.66.0F38.W1 76 /r

VPERMI2Q xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst

B

V/V

AVX512VL AVX512F

Permute quad-words from two tables in xmm3/m128/m64bcst and xmm2 using indexes in xmm1 and store the result in xmm1 using writemask k1.

EVEX.DDS.256.66.0F38.W1 76 /r

VPERMI2Q ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst

B

V/V

AVX512VL AVX512F

Permute quad-words from two tables in ymm3/m256/m64bcst and ymm2 using indexes in ymm1 and store the result in ymm1 using writemask k1.

EVEX.DDS.512.66.0F38.W1 76 /r

VPERMI2Q zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst

B

V/V

AVX512F

Permute quad-words from two tables in zmm3/m512/m64bcst and zmm2 using indices in zmm1 and store the result in zmm1 using writemask k1.

EVEX.DDS.128.66.0F38.W0 77 /r

VPERMI2PS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst

B

V/V

AVX512VL AVX512F

Permute single-precision FP values from two tables in xmm3/m128/m32bcst and xmm2 using indexes in xmm1 and store the result in xmm1 using writemask k1.

EVEX.DDS.256.66.0F38.W0 77 /r

VPERMI2PS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst

B

V/V

AVX512VL AVX512F

Permute single-precision FP values from two tables in ymm3/m256/m32bcst and ymm2 using indexes in ymm1 and store the result in ymm1 using writemask k1.

EVEX.DDS.512.66.0F38.W0 77 /r

VPERMI2PS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst

B

V/V

AVX512F

Permute single-precision FP values from two tables in zmm3/m512/m32bcst and zmm2 using indices in zmm1 and store the result in zmm1 using writemask k1.


Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.DDS.128.66.0F38.W1 77 /r

VPERMI2PD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst

B

V/V

AVX512VL AVX512F

Permute double-precision FP values from two tables in xmm3/m128/m64bcst and xmm2 using indexes in xmm1 and store the result in xmm1 using writemask k1.

EVEX.DDS.256.66.0F38.W1 77 /r

VPERMI2PD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst

B

V/V

AVX512VL AVX512F

Permute double-precision FP values from two tables in ymm3/m256/m64bcst and ymm2 using indexes in ymm1 and store the result in ymm1 using writemask k1.

EVEX.DDS.512.66.0F38.W1 77 /r

VPERMI2PD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst

B

V/V

AVX512F

Permute double-precision FP values from two tables in zmm3/m512/m64bcst and zmm2 using indices in zmm1 and store the result in zmm1 using writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full Mem

ModRM:reg (r,w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA

B

Full

ModRM:reg (r, w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA

Description

Permutes 16-bit/32-bit/64-bit values in the second operand (the first source operand) and the third operand (the second source operand) using indices in the first operand to select elements from the second and third operands. The selected elements are written to the destination operand (the first operand) according to the writemask k1.

The first and second operands are ZMM/YMM/XMM registers. The first operand contains input indices to select elements from the two input tables in the 2nd and 3rd operands. The first operand is also the destination of the result.

D/Q/PS/PD element versions: The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. Broadcast from the low 32/64-bit memory location is performed if EVEX.b and the id bit for table selection are set (selecting table_2).

Dword/PS versions: The id bit for table selection is bit 4/3/2, depending on VL=512, 256, 128. Bits [3:0]/[2:0]/[1:0] of each element in the input index vector select an element within the two source operands, If the id bit is 0, table_1 (the first source) is selected; otherwise the second source operand is selected.

Qword/PD versions: The id bit for table selection is bit 3/2/1, and bits [2:0]/[1:0] /bit 0 selects element within each input table.

Word element versions: The second source operand can be a ZMM/YMM/XMM register, or a 512/256/128-bit memory location. The id bit for table selection is bit 5/4/3, and bits [4:0]/[3:0]/[2:0] selects element within each input table.

Note that these instructions permit a 16-bit/32-bit/64-bit value in the source operands to be copied to more than one location in the destination operand. Note also that in this case, the same table can be reused for example for a second iteration, while the index elements are overwritten.

Bits (MAXVL-1:256/128) of the destination are zeroed for VL=256,128.



Operation

VPERMI2W (EVEX encoded versions)

(KL, VL) = (8, 128), (16, 256), (32, 512)

IF VL = 128

id 2

FI;

IF VL = 256

id 3

FI;

IF VL = 512

id 4

FI;

TMP_DEST DEST FOR j 0 TO KL-1

i j * 16

off 16*TMP_DEST[i+id:i] IF k1[j] OR *no writemask*

THEN

DEST[i+15:i]=TMP_DEST[i+id+1] ? SRC2[off+15:off]

: SRC1[off+15:off]

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+15:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VPERMI2D/VPERMI2PS (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)

IF VL = 128

id 1

FI;

IF VL = 256

id 2

FI;

IF VL = 512

id 3

FI;

TMP_DEST DEST FOR j 0 TO KL-1

i j * 32

off 32*TMP_DEST[i+id:i] IF k1[j] OR *no writemask*

THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN

DEST[i+31:i] TMP_DEST[i+id+1] ? SRC2[31:0]

: SRC1[off+31:off]

ELSE

DEST[i+31:i] TMP_DEST[i+id+1] ? SRC2[off+31:off]

: SRC1[off+31:off]



FI ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VPERMI2Q/VPERMI2PD (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256), (8 512)

IF VL = 128

id 0

FI;

IF VL = 256

id 1

FI;

IF VL = 512

id 2

FI;

TMP_DEST DEST FOR j 0 TO KL-1

i j * 64

off 64*TMP_DEST[i+id:i] IF k1[j] OR *no writemask*

THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN

DEST[i+63:i] TMP_DEST[i+id+1] ? SRC2[63:0]

: SRC1[off+63:off]

ELSE

DEST[i+63:i] TMP_DEST[i+id+1] ? SRC2[off+63:off]

: SRC1[off+63:off]

FI ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



Intel C/C++ Compiler Intrinsic Equivalent

VPERMI2D m512i _mm512_permutex2var_epi32( m512i a, m512i idx, m512i b);

VPERMI2D m512i _mm512_mask_permutex2var_epi32( m512i a, mmask16 k, m512i idx, m512i b); VPERMI2D m512i _mm512_mask2_permutex2var_epi32( m512i a, m512i idx, mmask16 k, m512i b); VPERMI2D m512i _mm512_maskz_permutex2var_epi32( mmask16 k, m512i a, m512i idx, m512i b); VPERMI m256i _mm256_permutex2var_epi32( m256i a, m256i idx, m256i b);

VPERMI2D m256i _mm256_mask_permutex2var_epi32( m256i a, mmask8 k, m256i idx, m256i b); VPERMI2D m256i _mm256_mask2_permutex2var_epi32( m256i a, m256i idx, mmask8 k, m256i b); VPERMI2D m256i _mm256_maskz_permutex2var_epi32( mmask8 k, m256i a, m256i idx, m256i b); VPERMI2D m128i _mm_permutex2var_epi32( m128i a, m128i idx, m128i b);

VPERMI2D m128i _mm_mask_permutex2var_epi32( m128i a, mmask8 k, m128i idx, m128i b); VPERMI2D m128i _mm_mask2_permutex2var_epi32( m128i a, m128i idx, mmask8 k, m128i b); VPERMI2D m128i _mm_maskz_permutex2var_epi32( mmask8 k, m128i a, m128i idx, m128i b); VPERMI2PD m512d _mm512_permutex2var_pd( m512d a, m512i idx, m512d b);

VPERMI2PD m512d _mm512_mask_permutex2var_pd( m512d a, mmask8 k, m512i idx, m512d b); VPERMI2PD m512d _mm512_mask2_permutex2var_pd( m512d a, m512i idx, mmask8 k, m512d b); VPERMI2PD m512d _mm512_maskz_permutex2var_pd( mmask8 k, m512d a, m512i idx, m512d b); VPERMI2PD m256d _mm256_permutex2var_pd( m256d a, m256i idx, m256d b);

VPERMI2PD m256d _mm256_mask_permutex2var_pd( m256d a, mmask8 k, m256i idx, m256d b); VPERMI2PD m256d _mm256_mask2_permutex2var_pd( m256d a, m256i idx, mmask8 k, m256d b); VPERMI2PD m256d _mm256_maskz_permutex2var_pd( mmask8 k, m256d a, m256i idx, m256d b); VPERMI2PD m128d _mm_permutex2var_pd( m128d a, m128i idx, m128d b);

VPERMI2PD m128d _mm_mask_permutex2var_pd( m128d a, mmask8 k, m128i idx, m128d b); VPERMI2PD m128d _mm_mask2_permutex2var_pd( m128d a, m128i idx, mmask8 k, m128d b); VPERMI2PD m128d _mm_maskz_permutex2var_pd( mmask8 k, m128d a, m128i idx, m128d b); VPERMI2PS m512 _mm512_permutex2var_ps( m512 a, m512i idx, m512 b);

VPERMI2PS m512 _mm512_mask_permutex2var_ps( m512 a, mmask16 k, m512i idx, m512 b); VPERMI2PS m512 _mm512_mask2_permutex2var_ps( m512 a, m512i idx, mmask16 k, m512 b); VPERMI2PS m512 _mm512_maskz_permutex2var_ps( mmask16 k, m512 a, m512i idx, m512 b); VPERMI2PS m256 _mm256_permutex2var_ps( m256 a, m256i idx, m256 b);

VPERMI2PS m256 _mm256_mask_permutex2var_ps( m256 a, mmask8 k, m256i idx, m256 b); VPERMI2PS m256 _mm256_mask2_permutex2var_ps( m256 a, m256i idx, mmask8 k, m256 b); VPERMI2PS m256 _mm256_maskz_permutex2var_ps( mmask8 k, m256 a, m256i idx, m256 b); VPERMI2PS m128 _mm_permutex2var_ps( m128 a, m128i idx, m128 b);

VPERMI2PS m128 _mm_mask_permutex2var_ps( m128 a, mmask8 k, m128i idx, m128 b); VPERMI2PS m128 _mm_mask2_permutex2var_ps( m128 a, m128i idx, mmask8 k, m128 b); VPERMI2PS m128 _mm_maskz_permutex2var_ps( mmask8 k, m128 a, m128i idx, m128 b); VPERMI2Q m512i _mm512_permutex2var_epi64( m512i a, m512i idx, m512i b);

VPERMI2Q m512i _mm512_mask_permutex2var_epi64( m512i a, mmask8 k, m512i idx, m512i b); VPERMI2Q m512i _mm512_mask2_permutex2var_epi64( m512i a, m512i idx, mmask8 k, m512i b); VPERMI2Q m512i _mm512_maskz_permutex2var_epi64( mmask8 k, m512i a, m512i idx, m512i b); VPERMI2Q m256i _mm256_permutex2var_epi64( m256i a, m256i idx, m256i b);

VPERMI2Q m256i _mm256_mask_permutex2var_epi64( m256i a, mmask8 k, m256i idx, m256i b); VPERMI2Q m256i _mm256_mask2_permutex2var_epi64( m256i a, m256i idx, mmask8 k, m256i b); VPERMI2Q m256i _mm256_maskz_permutex2var_epi64( mmask8 k, m256i a, m256i idx, m256i b); VPERMI2Q m128i _mm_permutex2var_epi64( m128i a, m128i idx, m128i b);

VPERMI2Q m128i _mm_mask_permutex2var_epi64( m128i a, mmask8 k, m128i idx, m128i b); VPERMI2Q m128i _mm_mask2_permutex2var_epi64( m128i a, m128i idx, mmask8 k, m128i b); VPERMI2Q m128i _mm_maskz_permutex2var_epi64( mmask8 k, m128i a, m128i idx, m128i b);


VPERMI2W m512i _mm512_permutex2var_epi16( m512i a, m512i idx, m512i b);

VPERMI2W m512i _mm512_mask_permutex2var_epi16( m512i a, mmask32 k, m512i idx, m512i b); VPERMI2W m512i _mm512_mask2_permutex2var_epi16( m512i a, m512i idx, mmask32 k, m512i b); VPERMI2W m512i _mm512_maskz_permutex2var_epi16( mmask32 k, m512i a, m512i idx, m512i b); VPERMI2W m256i _mm256_permutex2var_epi16( m256i a, m256i idx, m256i b);

VPERMI2W m256i _mm256_mask_permutex2var_epi16( m256i a, mmask16 k, m256i idx, m256i b); VPERMI2W m256i _mm256_mask2_permutex2var_epi16( m256i a, m256i idx, mmask16 k, m256i b); VPERMI2W m256i _mm256_maskz_permutex2var_epi16( mmask16 k, m256i a, m256i idx, m256i b); VPERMI2W m128i _mm_permutex2var_epi16( m128i a, m128i idx, m128i b);

VPERMI2W m128i _mm_mask_permutex2var_epi16( m128i a, mmask8 k, m128i idx, m128i b); VPERMI2W m128i _mm_mask2_permutex2var_epi16( m128i a, m128i idx, mmask8 k, m128i b); VPERMI2W m128i _mm_maskz_permutex2var_epi16( mmask8 k, m128i a, m128i idx, m128i b);

SIMD Floating-Point Exceptions

None


Other Exceptions

VPERMI2D/Q/PS/PD: See Exceptions Type E4NF.

VPERMI2W: See Exceptions Type E4NF.nb.


VPERMILPD—Permute In-Lane of Pairs of Double-Precision Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

VEX.NDS.128.66.0F38.W0 0D /r

VPERMILPD xmm1, xmm2, xmm3/m128

A

V/V

AVX

Permute double-precision floating-point values in xmm2 using controls from xmm3/m128 and store result in xmm1.

VEX.NDS.256.66.0F38.W0 0D /r

VPERMILPD ymm1, ymm2, ymm3/m256

A

V/V

AVX

Permute double-precision floating-point values in ymm2 using controls from ymm3/m256 and store result in ymm1.

EVEX.NDS.128.66.0F38.W1 0D /r

VPERMILPD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst

C

V/V

AVX512VL AVX512F

Permute double-precision floating-point values in xmm2 using control from xmm3/m128/m64bcst and store the result in xmm1 using writemask k1.

EVEX.NDS.256.66.0F38.W1 0D /r

VPERMILPD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst

C

V/V

AVX512VL AVX512F

Permute double-precision floating-point values in ymm2 using control from ymm3/m256/m64bcst and store the result in ymm1 using writemask k1.

EVEX.NDS.512.66.0F38.W1 0D /r

VPERMILPD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst

C

V/V

AVX512F

Permute double-precision floating-point values in zmm2 using control from zmm3/m512/m64bcst and store the result in zmm1 using writemask k1.

VEX.128.66.0F3A.W0 05 /r ib

VPERMILPD xmm1, xmm2/m128, imm8

B

V/V

AVX

Permute double-precision floating-point values in xmm2/m128 using controls from imm8.

VEX.256.66.0F3A.W0 05 /r ib

VPERMILPD ymm1, ymm2/m256, imm8

B

V/V

AVX

Permute double-precision floating-point values in ymm2/m256 using controls from imm8.

EVEX.128.66.0F3A.W1 05 /r ib VPERMILPD xmm1 {k1}{z},

xmm2/m128/m64bcst, imm8

D

V/V

AVX512VL AVX512F

Permute double-precision floating-point values in xmm2/m128/m64bcst using controls from imm8 and store the result in xmm1 using writemask k1.

EVEX.256.66.0F3A.W1 05 /r ib VPERMILPD ymm1 {k1}{z},

ymm2/m256/m64bcst, imm8

D

V/V

AVX512VL AVX512F

Permute double-precision floating-point values in ymm2/m256/m64bcst using controls from imm8 and store the result in ymm1 using writemask k1.

EVEX.512.66.0F3A.W1 05 /r ib VPERMILPD zmm1 {k1}{z},

zmm2/m512/m64bcst, imm8

D

V/V

AVX512F

Permute double-precision floating-point values in zmm2/m512/m64bcst using controls from imm8 and store the result in zmm1 using writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

VEX.vvvv (r)

ModRM:r/m (r)

NA

B

NA

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

C

Full

ModRM:reg (w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA

D

Full

ModRM:reg (w)

ModRM:r/m (r)

NA

NA



Description

(variable control version)

Permute pairs of double-precision floating-point values in the first source operand (second operand), each using a 1-bit control field residing in the corresponding quadword element of the second source operand (third operand). Permuted results are stored in the destination operand (first operand).

The control bits are located at bit 0 of each quadword element (see Figure 5-24). Each control determines which of the source element in an input pair is selected for the destination element. Each pair of source elements must lie in the same 128-bit region as the destination.

EVEX version: The second source operand (third operand) is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory location. Permuted results are written to the destination under the writemask.


image

X3

X2

X1

X0

SRC1


X2..X3

X2..X3

X0..X1

X0..X1

DEST


Figure 5-23. VPERMILPD Operation



VEX.256 encoded version: Bits (MAXVL-1:256) of the corresponding ZMM register are zeroed.



255


194 193



ignored


sel

ignored

. . .



ignored


sel

ignored


ignored


sel

ignored

127


66 65 63

Bit

image

2 1


Control Field 4

Control Field 2

Control Field1


Figure 5-24. VPERMILPD Shuffle Control



(immediate control version)

Permute pairs of double-precision floating-point values in the first source operand (second operand), each pair using a 1-bit control field in the imm8 byte. Each element in the destination operand (first operand) use a separate control bit of the imm8 byte.

VEX version: The source operand is a YMM/XMM register or a 256/128-bit memory location and the destination operand is a YMM/XMM register. Imm8 byte provides the lower 4/2 bit as permute control fields.

EVEX version: The source operand (second operand) is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory location. Permuted results are written to the destination under the writemask. Imm8 byte provides the lower 8/4/2 bit as permute control fields.

Note: For the imm8 versions, VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will

#UD.



Operation

VPERMILPD (EVEX immediate versions)

(KL, VL) = (8, 512)

FOR j 0 TO KL-1

i j * 64

IF (EVEX.b = 1) AND (SRC1 *is memory*) THEN TMP_SRC1[i+63:i] SRC1[63:0]; ELSE TMP_SRC1[i+63:i] SRC1[i+63:i];

FI;

ENDFOR;

IF (imm8[0] = 0) THEN TMP_DEST[63:0] SRC1[63:0]; FI;

IF (imm8[0] = 1) THEN TMP_DEST[63:0] TMP_SRC1[127:64]; FI; IF (imm8[1] = 0) THEN TMP_DEST[127:64] TMP_SRC1[63:0]; FI;

IF (imm8[1] = 1) THEN TMP_DEST[127:64] TMP_SRC1[127:64]; FI; IF VL >= 256

IF (imm8[2] = 0) THEN TMP_DEST[191:128] TMP_SRC1[191:128]; FI; IF (imm8[2] = 1) THEN TMP_DEST[191:128] TMP_SRC1[255:192]; FI; IF (imm8[3] = 0) THEN TMP_DEST[255:192] TMP_SRC1[191:128]; FI; IF (imm8[3] = 1) THEN TMP_DEST[255:192] TMP_SRC1[255:192]; FI;

FI;

IF VL >= 512

IF (imm8[4] = 0) THEN TMP_DEST[319:256] TMP_SRC1[319:256]; FI; IF (imm8[4] = 1) THEN TMP_DEST[319:256] TMP_SRC1[383:320]; FI; IF (imm8[5] = 0) THEN TMP_DEST[383:320] TMP_SRC1[319:256]; FI; IF (imm8[5] = 1) THEN TMP_DEST[383:320] TMP_SRC1[383:320]; FI; IF (imm8[6] = 0) THEN TMP_DEST[447:384] TMP_SRC1[447:384]; FI; IF (imm8[6] = 1) THEN TMP_DEST[447:384] TMP_SRC1[511:448]; FI; IF (imm8[7] = 0) THEN TMP_DEST[511:448] TMP_SRC1[447:384]; FI; IF (imm8[7] = 1) THEN TMP_DEST[511:448] TMP_SRC1[511:448]; FI;

FI;

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VPERMILPD (256-bit immediate version)

IF (imm8[0] = 0) THEN DEST[63:0]SRC1[63:0]

IF (imm8[0] = 1) THEN DEST[63:0]SRC1[127:64] IF (imm8[1] = 0) THEN DEST[127:64]SRC1[63:0]

IF (imm8[1] = 1) THEN DEST[127:64]SRC1[127:64]

IF (imm8[2] = 0) THEN DEST[191:128]SRC1[191:128] IF (imm8[2] = 1) THEN DEST[191:128]SRC1[255:192] IF (imm8[3] = 0) THEN DEST[255:192]SRC1[191:128] IF (imm8[3] = 1) THEN DEST[255:192]SRC1[255:192] DEST[MAXVL-1:256]0



VPERMILPD (128-bit immediate version)

IF (imm8[0] = 0) THEN DEST[63:0]SRC1[63:0]

IF (imm8[0] = 1) THEN DEST[63:0]SRC1[127:64] IF (imm8[1] = 0) THEN DEST[127:64]SRC1[63:0]

IF (imm8[1] = 1) THEN DEST[127:64]SRC1[127:64] DEST[MAXVL-1:128]0


VPERMILPD (EVEX variable versions)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN TMP_SRC2[i+63:i] SRC2[63:0]; ELSE TMP_SRC2[i+63:i] SRC2[i+63:i];

FI;

ENDFOR;


IF (TMP_SRC2[1] = 0) THEN TMP_DEST[63:0] SRC1[63:0]; FI;

IF (TMP_SRC2[1] = 1) THEN TMP_DEST[63:0] SRC1[127:64]; FI; IF (TMP_SRC2[65] = 0) THEN TMP_DEST[127:64] SRC1[63:0]; FI;

IF (TMP_SRC2[65] = 1) THEN TMP_DEST[127:64] SRC1[127:64]; FI; IF VL >= 256

IF (TMP_SRC2[129] = 0) THEN TMP_DEST[191:128] SRC1[191:128]; FI; IF (TMP_SRC2[129] = 1) THEN TMP_DEST[191:128] SRC1[255:192]; FI; IF (TMP_SRC2[193] = 0) THEN TMP_DEST[255:192] SRC1[191:128]; FI; IF (TMP_SRC2[193] = 1) THEN TMP_DEST[255:192] SRC1[255:192]; FI;

FI;

IF VL >= 512

IF (TMP_SRC2[257] = 0) THEN TMP_DEST[319:256] SRC1[319:256]; FI; IF (TMP_SRC2[257] = 1) THEN TMP_DEST[319:256] SRC1[383:320]; FI; IF (TMP_SRC2[321] = 0) THEN TMP_DEST[383:320] SRC1[319:256]; FI; IF (TMP_SRC2[321] = 1) THEN TMP_DEST[383:320] SRC1[383:320]; FI; IF (TMP_SRC2[385] = 0) THEN TMP_DEST[447:384] SRC1[447:384]; FI; IF (TMP_SRC2[385] = 1) THEN TMP_DEST[447:384] SRC1[511:448]; FI; IF (TMP_SRC2[449] = 0) THEN TMP_DEST[511:448] SRC1[447:384]; FI; IF (TMP_SRC2[449] = 1) THEN TMP_DEST[511:448] SRC1[511:448]; FI;

FI;


FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VPERMILPD (256-bit variable version)

IF (SRC2[1] = 0) THEN DEST[63:0]SRC1[63:0]

IF (SRC2[1] = 1) THEN DEST[63:0]SRC1[127:64] IF (SRC2[65] = 0) THEN DEST[127:64]SRC1[63:0]

IF (SRC2[65] = 1) THEN DEST[127:64]SRC1[127:64]

IF (SRC2[129] = 0) THEN DEST[191:128]SRC1[191:128] IF (SRC2[129] = 1) THEN DEST[191:128]SRC1[255:192] IF (SRC2[193] = 0) THEN DEST[255:192]SRC1[191:128] IF (SRC2[193] = 1) THEN DEST[255:192]SRC1[255:192] DEST[MAXVL-1:256]0


VPERMILPD (128-bit variable version)

IF (SRC2[1] = 0) THEN DEST[63:0]SRC1[63:0]

IF (SRC2[1] = 1) THEN DEST[63:0]SRC1[127:64] IF (SRC2[65] = 0) THEN DEST[127:64]SRC1[63:0]

IF (SRC2[65] = 1) THEN DEST[127:64]SRC1[127:64] DEST[MAXVL-1:128]0


Intel C/C++ Compiler Intrinsic Equivalent

VPERMILPD m512d _mm512_permute_pd( m512d a, int imm);

VPERMILPD m512d _mm512_mask_permute_pd( m512d s, mmask8 k, m512d a, int imm); VPERMILPD m512d _mm512_maskz_permute_pd( mmask8 k, m512d a, int imm); VPERMILPD m256d _mm256_mask_permute_pd( m256d s, mmask8 k, m256d a, int imm); VPERMILPD m256d _mm256_maskz_permute_pd( mmask8 k, m256d a, int imm); VPERMILPD m128d _mm_mask_permute_pd( m128d s, mmask8 k, m128d a, int imm); VPERMILPD m128d _mm_maskz_permute_pd( mmask8 k, m128d a, int imm);

VPERMILPD m512d _mm512_permutevar_pd( m512i i, m512d a);

VPERMILPD m512d _mm512_mask_permutevar_pd( m512d s, mmask8 k, m512i i, m512d a); VPERMILPD m512d _mm512_maskz_permutevar_pd( mmask8 k, m512i i, m512d a); VPERMILPD m256d _mm256_mask_permutevar_pd( m256d s, mmask8 k, m256d i, m256d a); VPERMILPD m256d _mm256_maskz_permutevar_pd( mmask8 k, m256d i, m256d a); VPERMILPD m128d _mm_mask_permutevar_pd( m128d s, mmask8 k, m128d i, m128d a); VPERMILPD m128d _mm_maskz_permutevar_pd( mmask8 k, m128d i, m128d a);

VPERMILPD m128d _mm_permute_pd ( m128d a, int control) VPERMILPD m256d _mm256_permute_pd ( m256d a, int control) VPERMILPD m128d _mm_permutevar_pd ( m128d a, m128i control);

VPERMILPD m256d _mm256_permutevar_pd ( m256d a, m256i control);


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4; additionally

#UD If VEX.W = 1.

EVEX-encoded instruction, see Exceptions Type E4NF.

#UD If either (E)VEX.vvvv != 1111B and with imm8.


VPERMILPS—Permute In-Lane of Quadruples of Single-Precision Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

VEX.NDS.128.66.0F38.W0 0C /r

VPERMILPS xmm1, xmm2, xmm3/m128

A

V/V

AVX

Permute single-precision floating-point values in xmm2 using controls from xmm3/m128 and store result in xmm1.

VEX.128.66.0F3A.W0 04 /r ib

VPERMILPS xmm1, xmm2/m128, imm8

B

V/V

AVX

Permute single-precision floating-point values in xmm2/m128 using controls from imm8 and store result in xmm1.

VEX.NDS.256.66.0F38.W0 0C /r

VPERMILPS ymm1, ymm2, ymm3/m256

A

V/V

AVX

Permute single-precision floating-point values in ymm2 using controls from ymm3/m256 and store result in ymm1.

VEX.256.66.0F3A.W0 04 /r ib

VPERMILPS ymm1, ymm2/m256, imm8

B

V/V

AVX

Permute single-precision floating-point values in ymm2/m256 using controls from imm8 and store result in ymm1.

EVEX.NDS.128.66.0F38.W0 0C /r

VPERMILPS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst

C

V/V

AVX512VL AVX512F

Permute single-precision floating-point values xmm2 using control from xmm3/m128/m32bcst and store the result in xmm1 using writemask k1.

EVEX.NDS.256.66.0F38.W0 0C /r

VPERMILPS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst

C

V/V

AVX512VL AVX512F

Permute single-precision floating-point values ymm2 using control from ymm3/m256/m32bcst and store the result in ymm1 using writemask k1.

EVEX.NDS.512.66.0F38.W0 0C /r

VPERMILPS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst

C

V/V

AVX512F

Permute single-precision floating-point values zmm2 using control from zmm3/m512/m32bcst and store the result in zmm1 using writemask k1.

EVEX.128.66.0F3A.W0 04 /r ib VPERMILPS xmm1 {k1}{z},

xmm2/m128/m32bcst, imm8

D

V/V

AVX512VL AVX512F

Permute single-precision floating-point values xmm2/m128/m32bcst using controls from imm8 and store the result in xmm1 using writemask k1.

EVEX.256.66.0F3A.W0 04 /r ib VPERMILPS ymm1 {k1}{z},

ymm2/m256/m32bcst, imm8

D

V/V

AVX512VL AVX512F

Permute single-precision floating-point values ymm2/m256/m32bcst using controls from imm8 and store the result in ymm1 using writemask k1.

EVEX.512.66.0F3A.W0 04 /r

ibVPERMILPS zmm1 {k1}{z}, zmm2/m512/m32bcst, imm8

D

V/V

AVX512F

Permute single-precision floating-point values zmm2/m512/m32bcst using controls from imm8 and store the result in zmm1 using writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

VEX.vvvv (r)

ModRM:r/m (r)

NA

B

NA

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

C

Full

ModRM:reg (w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA

D

Full

ModRM:reg (w)

ModRM:r/m (r)

NA

NA



Description

(variable control version)

Permute quadruples of single-precision floating-point values in the first source operand (second operand), each quadruplet using a 2-bit control field in the corresponding dword element of the second source operand. Permuted results are stored in the destination operand (first operand).

The 2-bit control fields are located at the low two bits of each dword element (see Figure 5-26). Each control deter- mines which of the source element in an input quadruple is selected for the destination element. Each quadruple of source elements must lie in the same 128-bit region as the destination.

EVEX version: The second source operand (third operand) is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32-bit memory location. Permuted results are written to the destination under the writemask.



image

X7

X6

X5

X4

X3

X2

X1

X0

SRC1


X7 .. X4

X7 .. X4

X7 .. X4

X7 .. X4

X3 ..X0

X3 ..X0

X3 .. X0

X3 .. X0

DEST


Figure 5-25. VPERMILPS Operation




ignored

255



sel

226 225 224



ignored


sel


ignored


sel

63


. . .


34 33 32 31

Bit

image

1 0


Control Field 7

Control Field 2

Control Field 1


Figure 5-26. VPERMILPS Shuffle Control



(immediate control version)

Permute quadruples of single-precision floating-point values in the first source operand (second operand), each quadruplet using a 2-bit control field in the imm8 byte. Each 128-bit lane in the destination operand (first operand) use the four control fields of the same imm8 byte.

VEX version: The source operand is a YMM/XMM register or a 256/128-bit memory location and the destination operand is a YMM/XMM register.

EVEX version: The source operand (second operand) is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32-bit memory location. Permuted results are written to the destination under the writemask.

Note: For the imm8 version, VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will

#UD.



Operation

Select4(SRC, control) { CASE (control[1:0]) OF

0: TMP SRC[31:0];

1: TMP SRC[63:32];

2: TMP SRC[95:64];

3: TMP SRC[127:96]; ESAC;

RETURN TMP

}


VPERMILPS (EVEX immediate versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF (EVEX.b = 1) AND (SRC1 *is memory*) THEN TMP_SRC1[i+31:i] SRC1[31:0]; ELSE TMP_SRC1[i+31:i] SRC1[i+31:i];

FI;

ENDFOR;


TMP_DEST[31:0] Select4(TMP_SRC1[127:0], imm8[1:0]); TMP_DEST[63:32] Select4(TMP_SRC1[127:0], imm8[3:2]); TMP_DEST[95:64] Select4(TMP_SRC1[127:0], imm8[5:4]); TMP_DEST[127:96] Select4(TMP_SRC1[127:0], imm8[7:6]); FI; IF VL >= 256

TMP_DEST[159:128] Select4(TMP_SRC1[255:128], imm8[1:0]); FI; TMP_DEST[191:160] Select4(TMP_SRC1[255:128], imm8[3:2]); FI; TMP_DEST[223:192] Select4(TMP_SRC1[255:128], imm8[5:4]); FI; TMP_DEST[255:224] Select4(TMP_SRC1[255:128], imm8[7:6]); FI;

FI;

IF VL >= 512

TMP_DEST[287:256] Select4(TMP_SRC1[383:256], imm8[1:0]); FI; TMP_DEST[319:288] Select4(TMP_SRC1[383:256], imm8[3:2]); FI; TMP_DEST[351:320] Select4(TMP_SRC1[383:256], imm8[5:4]); FI; TMP_DEST[383:352] Select4(TMP_SRC1[383:256], imm8[7:6]); FI; TMP_DEST[415:384] Select4(TMP_SRC1[511:384], imm8[1:0]); FI; TMP_DEST[447:416] Select4(TMP_SRC1[511:384], imm8[3:2]); FI; TMP_DEST[479:448] Select4(TMP_SRC1[511:384], imm8[5:4]); FI; TMP_DEST[511:480] Select4(TMP_SRC1[511:384], imm8[7:6]); FI;

FI;

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE

IF *merging-masking*

THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i] 0 ;zeroing-masking

FI;

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VPERMILPS (256-bit immediate version)

DEST[31:0] Select4(SRC1[127:0], imm8[1:0]);

DEST[63:32] Select4(SRC1[127:0], imm8[3:2]);

DEST[95:64] Select4(SRC1[127:0], imm8[5:4]); DEST[127:96] Select4(SRC1[127:0], imm8[7:6]); DEST[159:128] Select4(SRC1[255:128], imm8[1:0]); DEST[191:160] Select4(SRC1[255:128], imm8[3:2]); DEST[223:192] Select4(SRC1[255:128], imm8[5:4]); DEST[255:224] Select4(SRC1[255:128], imm8[7:6]);


VPERMILPS (128-bit immediate version)

DEST[31:0] Select4(SRC1[127:0], imm8[1:0]);

DEST[63:32] Select4(SRC1[127:0], imm8[3:2]);

DEST[95:64] Select4(SRC1[127:0], imm8[5:4]); DEST[127:96] Select4(SRC1[127:0], imm8[7:6]); DEST[MAXVL-1:128]0


VPERMILPS (EVEX variable versions)

(KL, VL) = (16, 512)

FOR j 0 TO KL-1

i j * 32

IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN TMP_SRC2[i+31:i] SRC2[31:0]; ELSE TMP_SRC2[i+31:i] SRC2[i+31:i];

FI;

ENDFOR;

TMP_DEST[31:0] Select4(SRC1[127:0], TMP_SRC2[1:0]); TMP_DEST[63:32] Select4(SRC1[127:0], TMP_SRC2[33:32]); TMP_DEST[95:64] Select4(SRC1[127:0], TMP_SRC2[65:64]); TMP_DEST[127:96] Select4(SRC1[127:0], TMP_SRC2[97:96]); IF VL >= 256

TMP_DEST[159:128] Select4(SRC1[255:128], TMP_SRC2[129:128]); TMP_DEST[191:160] Select4(SRC1[255:128], TMP_SRC2[161:160]); TMP_DEST[223:192] Select4(SRC1[255:128], TMP_SRC2[193:192]); TMP_DEST[255:224] Select4(SRC1[255:128], TMP_SRC2[225:224]);

FI;

IF VL >= 512

TMP_DEST[287:256] Select4(SRC1[383:256], TMP_SRC2[257:256]); TMP_DEST[319:288] Select4(SRC1[383:256], TMP_SRC2[289:288]); TMP_DEST[351:320] Select4(SRC1[383:256], TMP_SRC2[321:320]); TMP_DEST[383:352] Select4(SRC1[383:256], TMP_SRC2[353:352]); TMP_DEST[415:384] Select4(SRC1[511:384], TMP_SRC2[385:384]); TMP_DEST[447:416] Select4(SRC1[511:384], TMP_SRC2[417:416]); TMP_DEST[479:448] Select4(SRC1[511:384], TMP_SRC2[449:448]); TMP_DEST[511:480] Select4(SRC1[511:384], TMP_SRC2[481:480]);

FI;

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE

IF *merging-masking*

THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i] 0 ;zeroing-masking



FI;

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VPERMILPS (256-bit variable version) DEST[31:0] Select4(SRC1[127:0], SRC2[1:0]); DEST[63:32] Select4(SRC1[127:0], SRC2[33:32]); DEST[95:64] Select4(SRC1[127:0], SRC2[65:64]);

DEST[127:96] Select4(SRC1[127:0], SRC2[97:96]); DEST[159:128] Select4(SRC1[255:128], SRC2[129:128]); DEST[191:160] Select4(SRC1[255:128], SRC2[161:160]); DEST[223:192] Select4(SRC1[255:128], SRC2[193:192]); DEST[255:224] Select4(SRC1[255:128], SRC2[225:224]); DEST[MAXVL-1:256]0


VPERMILPS (128-bit variable version) DEST[31:0] Select4(SRC1[127:0], SRC2[1:0]); DEST[63:32] Select4(SRC1[127:0], SRC2[33:32]); DEST[95:64] Select4(SRC1[127:0], SRC2[65:64]);

DEST[127:96] Select4(SRC1[127:0], SRC2[97:96]); DEST[MAXVL-1:128]0


Intel C/C++ Compiler Intrinsic Equivalent

VPERMILPS m512 _mm512_permute_ps( m512 a, int imm);

VPERMILPS m512 _mm512_mask_permute_ps( m512 s, mmask16 k, m512 a, int imm); VPERMILPS m512 _mm512_maskz_permute_ps( mmask16 k, m512 a, int imm); VPERMILPS m256 _mm256_mask_permute_ps( m256 s, mmask8 k, m256 a, int imm); VPERMILPS m256 _mm256_maskz_permute_ps( mmask8 k, m256 a, int imm); VPERMILPS m128 _mm_mask_permute_ps( m128 s, mmask8 k, m128 a, int imm); VPERMILPS m128 _mm_maskz_permute_ps( mmask8 k, m128 a, int imm);

VPERMILPS m512 _mm512_permutevar_ps( m512i i, m512 a);

VPERMILPS m512 _mm512_mask_permutevar_ps( m512 s, mmask16 k, m512i i, m512 a); VPERMILPS m512 _mm512_maskz_permutevar_ps( mmask16 k, m512i i, m512 a); VPERMILPS m256 _mm256_mask_permutevar_ps( m256 s, mmask8 k, m256 i, m256 a); VPERMILPS m256 _mm256_maskz_permutevar_ps( mmask8 k, m256 i, m256 a); VPERMILPS m128 _mm_mask_permutevar_ps( m128 s, mmask8 k, m128 i, m128 a); VPERMILPS m128 _mm_maskz_permutevar_ps( mmask8 k, m128 i, m128 a);

VPERMILPS m128 _mm_permute_ps ( m128 a, int control); VPERMILPS m256 _mm256_permute_ps ( m256 a, int control); VPERMILPS m128 _mm_permutevar_ps ( m128 a, m128i control);

VPERMILPS m256 _mm256_permutevar_ps ( m256 a, m256i control);


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4;

#UD If VEX.W = 1.

EVEX-encoded instruction, see Exceptions Type E4NF.

#UD If either (E)VEX.vvvv != 1111B and with imm8.


VPERMPD—Permute Double-Precision Floating-Point Elements

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

VEX.256.66.0F3A.W1 01 /r ib

VPERMPD ymm1, ymm2/m256, imm8

A

V/V

AVX2

Permute double-precision floating-point elements in ymm2/m256 using indices in imm8 and store the result in ymm1.

EVEX.256.66.0F3A.W1 01 /r ib VPERMPD ymm1 {k1}{z},

ymm2/m256/m64bcst, imm8

B

V/V

AVX512VL AVX512F

Permute double-precision floating-point elements in ymm2/m256/m64bcst using indexes in imm8 and store the result in ymm1 subject to writemask k1.

EVEX.512.66.0F3A.W1 01 /r ib VPERMPD zmm1 {k1}{z},

zmm2/m512/m64bcst, imm8

B

V/V

AVX512F

Permute double-precision floating-point elements in zmm2/m512/m64bcst using indices in imm8 and store the result in zmm1 subject to writemask k1.

EVEX.NDS.256.66.0F38.W1 16 /r

VPERMPD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst

C

V/V

AVX512VL AVX512F

Permute double-precision floating-point elements in ymm3/m256/m64bcst using indexes in ymm2 and store the result in ymm1 subject to writemask k1.

EVEX.NDS.512.66.0F38.W1 16 /r

VPERMPD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst

C

V/V

AVX512F

Permute double-precision floating-point elements in zmm3/m512/m64bcst using indices in zmm2 and store the result in zmm1 subject to writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

ModRM:r/m (r)

Imm8

NA

B

Full

ModRM:reg (w)

ModRM:r/m (r)

Imm8

NA

C

Full

ModRM:reg (w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA

Description

The imm8 version: Copies quadword elements of double-precision floating-point values from the source operand (the second operand) to the destination operand (the first operand) according to the indices specified by the imme- diate operand (the third operand). Each two-bit value in the immediate byte selects a qword element in the source operand.

VEX version: The source operand can be a YMM register or a memory location. Bits (MAXVL-1:256) of the corre- sponding destination register are zeroed.

In EVEX.512 encoded version, The elements in the destination are updated using the writemask k1 and the imm8 bits are reused as control bits for the upper 256-bit half when the control bits are coming from immediate. The source operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 64-bit memory location.

The imm8 versions: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

The vector control version: Copies quadword elements of double-precision floating-point values from the second source operand (the third operand) to the destination operand (the first operand) according to the indices in the first source operand (the second operand). The first 3 bits of each 64 bit element in the index operand selects which quadword in the second source operand to copy. The first and second operands are ZMM registers, the third operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 64-bit memory location. The elements in the destination are updated using the writemask k1.

Note that this instruction permits a qword in the source operand to be copied to multiple locations in the destination operand.

If VPERMPD is encoded with VEX.L= 0, an attempt to execute the instruction encoded with VEX.L= 0 will cause an

#UD exception.



Operation

VPERMPD (EVEX - imm8 control forms)

(KL, VL) = (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF (EVEX.b = 1) AND (SRC *is memory*) THEN TMP_SRC[i+63:i] SRC[63:0]; ELSE TMP_SRC[i+63:i] SRC[i+63:i];

FI;

ENDFOR;


TMP_DEST[63:0] (TMP_SRC[256:0] >> (IMM8[1:0] * 64))[63:0];

TMP_DEST[127:64] (TMP_SRC[256:0] >> (IMM8[3:2] * 64))[63:0]; TMP_DEST[191:128] (TMP_SRC[256:0] >> (IMM8[5:4] * 64))[63:0]; TMP_DEST[255:192] (TMP_SRC[256:0] >> (IMM8[7:6] * 64))[63:0]; IF VL >= 512

TMP_DEST[319:256] (TMP_SRC[511:256] >> (IMM8[1:0] * 64))[63:0]; TMP_DEST[383:320] (TMP_SRC[511:256] >> (IMM8[3:2] * 64))[63:0]; TMP_DEST[447:384] (TMP_SRC[511:256] >> (IMM8[5:4] * 64))[63:0]; TMP_DEST[511:448] (TMP_SRC[511:256] >> (IMM8[7:6] * 64))[63:0];

FI;

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0 ;zeroing-masking

FI;

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VPERMPD (EVEX - vector control forms)

(KL, VL) = (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN TMP_SRC2[i+63:i] SRC2[63:0]; ELSE TMP_SRC2[i+63:i] SRC2[i+63:i];

FI;

ENDFOR;


IF VL = 256

TMP_DEST[63:0] (TMP_SRC2[255:0] >> (SRC1[1:0] * 64))[63:0]; TMP_DEST[127:64] (TMP_SRC2[255:0] >> (SRC1[65:64] * 64))[63:0]; TMP_DEST[191:128] (TMP_SRC2[255:0] >> (SRC1[129:128] * 64))[63:0]; TMP_DEST[255:192] (TMP_SRC2[255:0] >> (SRC1[193:192] * 64))[63:0];

FI;

IF VL = 512

TMP_DEST[63:0] (TMP_SRC2[511:0] >> (SRC1[2:0] * 64))[63:0];



FI;


TMP_DEST[127:64] (TMP_SRC2[511:0] >> (SRC1[66:64] * 64))[63:0]; TMP_DEST[191:128] (TMP_SRC2[511:0] >> (SRC1[130:128] * 64))[63:0]; TMP_DEST[255:192] (TMP_SRC2[511:0] >> (SRC1[194:192] * 64))[63:0]; TMP_DEST[319:256] (TMP_SRC2[511:0] >> (SRC1[258:256] * 64))[63:0]; TMP_DEST[383:320] (TMP_SRC2[511:0] >> (SRC1[322:320] * 64))[63:0]; TMP_DEST[447:384] (TMP_SRC2[511:0] >> (SRC1[386:384] * 64))[63:0]; TMP_DEST[511:448] (TMP_SRC2[511:0] >> (SRC1[450:448] * 64))[63:0];

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0 ;zeroing-masking

FI;

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VPERMPD (VEX.256 encoded version)

DEST[63:0] (SRC[255:0] >> (IMM8[1:0] * 64))[63:0];

DEST[127:64] (SRC[255:0] >> (IMM8[3:2] * 64))[63:0];

DEST[191:128] (SRC[255:0] >> (IMM8[5:4] * 64))[63:0];

DEST[255:192] (SRC[255:0] >> (IMM8[7:6] * 64))[63:0]; DEST[MAXVL-1:256] 0


Intel C/C++ Compiler Intrinsic Equivalent

VPERMPD m512d _mm512_permutex_pd( m512d a, int imm);

VPERMPD m512d _mm512_mask_permutex_pd( m512d s, mmask16 k, m512d a, int imm); VPERMPD m512d _mm512_maskz_permutex_pd( mmask16 k, m512d a, int imm); VPERMPD m512d _mm512_permutexvar_pd( m512i i, m512d a);

VPERMPD m512d _mm512_mask_permutexvar_pd( m512d s, mmask16 k, m512i i, m512d a); VPERMPD m512d _mm512_maskz_permutexvar_pd( mmask16 k, m512i i, m512d a); VPERMPD m256d _mm256_permutex_epi64( m256d a, int imm);

VPERMPD m256d _mm256_mask_permutex_epi64( m256i s, mmask8 k, m256d a, int imm); VPERMPD m256d _mm256_maskz_permutex_epi64( mmask8 k, m256d a, int imm); VPERMPD m256d _mm256_permutexvar_epi64( m256i i, m256d a);

VPERMPD m256d _mm256_mask_permutexvar_epi64( m256i s, mmask8 k, m256i i, m256d a); VPERMPD m256d _mm256_maskz_permutexvar_epi64( mmask8 k, m256i i, m256d a);


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4; additionally

#UD If VEX.L = 0.

If VEX.vvvv != 1111B.

EVEX-encoded instruction, see Exceptions Type E4NF.

#UD If encoded with EVEX.128.

If EVEX.vvvv != 1111B and with imm8.


VPERMPS—Permute Single-Precision Floating-Point Elements

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

VEX.256.66.0F38.W0 16 /r

VPERMPS ymm1, ymm2, ymm3/m256

A

V/V

AVX2

Permute single-precision floating-point elements in ymm3/m256 using indices in ymm2 and store the result in ymm1.

EVEX.NDS.256.66.0F38.W0 16 /r

VPERMPS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst

B

V/V

AVX512VL AVX512F

Permute single-precision floating-point elements in ymm3/m256/m32bcst using indexes in ymm2 and store the result in ymm1 subject to write mask k1.

EVEX.NDS.512.66.0F38.W0 16 /r

VPERMPS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst

B

V/V

AVX512F

Permute single-precision floating-point values in zmm3/m512/m32bcst using indices in zmm2 and store the result in zmm1 subject to write mask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

VEX.vvvv (r)

ModRM:r/m (r)

NA

B

Full

ModRM:reg (w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA

Description

Copies doubleword elements of single-precision floating-point values from the second source operand (the third operand) to the destination operand (the first operand) according to the indices in the first source operand (the second operand). Note that this instruction permits a doubleword in the source operand to be copied to more than one location in the destination operand.

VEX.256 versions: The first and second operands are YMM registers, the third operand can be a YMM register or memory location. Bits (MAXVL-1:256) of the corresponding destination register are zeroed.

EVEX encoded version: The first and second operands are ZMM registers, the third operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32-bit memory location. The elements in the destination are updated using the writemask k1.

If VPERMPS is encoded with VEX.L= 0, an attempt to execute the instruction encoded with VEX.L= 0 will cause an

#UD exception.


Operation

VPERMPS (EVEX forms)

(KL, VL) (8, 256),= (16, 512)

FOR j 0 TO KL-1

i j * 64

IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN TMP_SRC2[i+31:i] SRC2[31:0]; ELSE TMP_SRC2[i+31:i] SRC2[i+31:i];

FI;

ENDFOR;


IF VL = 256

TMP_DEST[31:0] (TMP_SRC2[255:0] >> (SRC1[2:0] * 32))[31:0]; TMP_DEST[63:32] (TMP_SRC2[255:0] >> (SRC1[34:32] * 32))[31:0]; TMP_DEST[95:64] (TMP_SRC2[255:0] >> (SRC1[66:64] * 32))[31:0]; TMP_DEST[127:96] (TMP_SRC2[255:0] >> (SRC1[98:96] * 32))[31:0]; TMP_DEST[159:128] (TMP_SRC2[255:0] >> (SRC1[130:128] * 32))[31:0]; TMP_DEST[191:160] (TMP_SRC2[255:0] >> (SRC1[162:160] * 32))[31:0]; TMP_DEST[223:192] (TMP_SRC2[255:0] >> (SRC1[193:192] * 32))[31:0]; TMP_DEST[255:224] (TMP_SRC2[255:0] >> (SRC1[226:224] * 32))[31:0];



FI;

IF VL = 512

TMP_DEST[31:0] (TMP_SRC2[511:0] >> (SRC1[3:0] * 32))[31:0]; TMP_DEST[63:32] (TMP_SRC2[511:0] >> (SRC1[35:32] * 32))[31:0]; TMP_DEST[95:64] (TMP_SRC2[511:0] >> (SRC1[67:64] * 32))[31:0]; TMP_DEST[127:96] (TMP_SRC2[511:0] >> (SRC1[99:96] * 32))[31:0]; TMP_DEST[159:128] (TMP_SRC2[511:0] >> (SRC1[131:128] * 32))[31:0]; TMP_DEST[191:160] (TMP_SRC2[511:0] >> (SRC1[163:160] * 32))[31:0]; TMP_DEST[223:192] (TMP_SRC2[511:0] >> (SRC1[195:192] * 32))[31:0]; TMP_DEST[255:224] (TMP_SRC2[511:0] >> (SRC1[227:224] * 32))[31:0]; TMP_DEST[287:256] (TMP_SRC2[511:0] >> (SRC1[259:256] * 32))[31:0]; TMP_DEST[319:288] (TMP_SRC2[511:0] >> (SRC1[291:288] * 32))[31:0]; TMP_DEST[351:320] (TMP_SRC2[511:0] >> (SRC1[323:320] * 32))[31:0]; TMP_DEST[383:352] (TMP_SRC2[511:0] >> (SRC1[355:352] * 32))[31:0]; TMP_DEST[415:384] (TMP_SRC2[511:0] >> (SRC1[387:384] * 32))[31:0]; TMP_DEST[447:416] (TMP_SRC2[511:0] >> (SRC1[419:416] * 32))[31:0]; TMP_DEST[479:448] (TMP_SRC2[511:0] >> (SRC1[451:448] * 32))[31:0]; TMP_DEST[511:480] (TMP_SRC2[511:0] >> (SRC1[483:480] * 32))[31:0];

FI;

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0 ;zeroing-masking

FI;

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VPERMPS (VEX.256 encoded version)

DEST[31:0] (SRC2[255:0] >> (SRC1[2:0] * 32))[31:0];

DEST[63:32] (SRC2[255:0] >> (SRC1[34:32] * 32))[31:0];

DEST[95:64] (SRC2[255:0] >> (SRC1[66:64] * 32))[31:0];

DEST[127:96] (SRC2[255:0] >> (SRC1[98:96] * 32))[31:0];

DEST[159:128] (SRC2[255:0] >> (SRC1[130:128] * 32))[31:0];

DEST[191:160] (SRC2[255:0] >> (SRC1[162:160] * 32))[31:0];

DEST[223:192] (SRC2[255:0] >> (SRC1[194:192] * 32))[31:0];

DEST[255:224] (SRC2[255:0] >> (SRC1[226:224] * 32))[31:0]; DEST[MAXVL-1:256] 0



Intel C/C++ Compiler Intrinsic Equivalent

VPERMPS m512 _mm512_permutexvar_ps( m512i i, m512 a);

VPERMPS m512 _mm512_mask_permutexvar_ps( m512 s, mmask16 k, m512i i, m512 a); VPERMPS m512 _mm512_maskz_permutexvar_ps( mmask16 k, m512i i, m512 a); VPERMPS m256 _mm256_permutexvar_ps( m256 i, m256 a);

VPERMPS m256 _mm256_mask_permutexvar_ps( m256 s, mmask8 k, m256 i, m256 a); VPERMPS m256 _mm256_maskz_permutexvar_ps( mmask8 k, m256 i, m256 a);


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4; additionally

#UD If VEX.L = 0.

EVEX-encoded instruction, see Exceptions Type E4NF.


VPERMQ—Qwords Element Permutation

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

VEX.256.66.0F3A.W1 00 /r ib

VPERMQ ymm1, ymm2/m256, imm8

A

V/V

AVX2

Permute qwords in ymm2/m256 using indices in imm8 and store the result in ymm1.

EVEX.256.66.0F3A.W1 00 /r ib VPERMQ ymm1 {k1}{z},

ymm2/m256/m64bcst, imm8

B

V/V

AVX512VL AVX512F

Permute qwords in ymm2/m256/m64bcst using indexes in imm8 and store the result in ymm1.

EVEX.512.66.0F3A.W1 00 /r ib VPERMQ zmm1 {k1}{z},

zmm2/m512/m64bcst, imm8

B

V/V

AVX512F

Permute qwords in zmm2/m512/m64bcst using indices in imm8 and store the result in zmm1.

EVEX.NDS.256.66.0F38.W1 36 /r

VPERMQ ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst

C

V/V

AVX512VL AVX512F

Permute qwords in ymm3/m256/m64bcst using indexes in ymm2 and store the result in ymm1.

EVEX.NDS.512.66.0F38.W1 36 /r

VPERMQ zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst

C

V/V

AVX512F

Permute qwords in zmm3/m512/m64bcst using indices in zmm2 and store the result in zmm1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

ModRM:r/m (r)

Imm8

NA

B

Full

ModRM:reg (w)

ModRM:r/m (r)

Imm8

NA

C

Full

ModRM:reg (w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA

Description

The imm8 version: Copies quadwords from the source operand (the second operand) to the destination operand (the first operand) according to the indices specified by the immediate operand (the third operand). Each two-bit value in the immediate byte selects a qword element in the source operand.

VEX version: The source operand can be a YMM register or a memory location. Bits (MAXVL-1:256) of the corre- sponding destination register are zeroed.

In EVEX.512 encoded version, The elements in the destination are updated using the writemask k1 and the imm8 bits are reused as control bits for the upper 256-bit half when the control bits are coming from immediate. The source operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 64-bit memory location.

Immediate control versions: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will

#UD.

The vector control version: Copies quadwords from the second source operand (the third operand) to the destina- tion operand (the first operand) according to the indices in the first source operand (the second operand). The first 3 bits of each 64 bit element in the index operand selects which quadword in the second source operand to copy. The first and second operands are ZMM registers, the third operand can be a ZMM register, a 512-bit memory loca- tion or a 512-bit vector broadcasted from a 64-bit memory location. The elements in the destination are updated using the writemask k1.

Note that this instruction permits a qword in the source operand to be copied to multiple locations in the destination operand.

If VPERMPQ is encoded with VEX.L= 0 or EVEX.128, an attempt to execute the instruction will cause an #UD excep- tion.



Operation

VPERMQ (EVEX - imm8 control forms)

(KL, VL) = (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF (EVEX.b = 1) AND (SRC *is memory*) THEN TMP_SRC[i+63:i] SRC[63:0]; ELSE TMP_SRC[i+63:i] SRC[i+63:i];

FI;

ENDFOR;

TMP_DEST[63:0] (TMP_SRC[255:0] >> (IMM8[1:0] * 64))[63:0];

TMP_DEST[127:64] (TMP_SRC[255:0] >> (IMM8[3:2] * 64))[63:0]; TMP_DEST[191:128] (TMP_SRC[255:0] >> (IMM8[5:4] * 64))[63:0]; TMP_DEST[255:192] (TMP_SRC[255:0] >> (IMM8[7:6] * 64))[63:0];

IF VL >= 512

TMP_DEST[319:256] (TMP_SRC[511:256] >> (IMM8[1:0] * 64))[63:0]; TMP_DEST[383:320] (TMP_SRC[511:256] >> (IMM8[3:2] * 64))[63:0]; TMP_DEST[447:384] (TMP_SRC[511:256] >> (IMM8[5:4] * 64))[63:0]; TMP_DEST[511:448] (TMP_SRC[511:256] >> (IMM8[7:6] * 64))[63:0];

FI;

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0 ;zeroing-masking

FI;

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VPERMQ (EVEX - vector control forms)

(KL, VL) = (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN TMP_SRC2[i+63:i] SRC2[63:0]; ELSE TMP_SRC2[i+63:i] SRC2[i+63:i];

FI;

ENDFOR;

IF VL = 256

TMP_DEST[63:0] (TMP_SRC2[255:0] >> (SRC1[1:0] * 64))[63:0]; TMP_DEST[127:64] (TMP_SRC2[255:0] >> (SRC1[65:64] * 64))[63:0]; TMP_DEST[191:128] (TMP_SRC2[255:0] >> (SRC1[129:128] * 64))[63:0]; TMP_DEST[255:192] (TMP_SRC2[255:0] >> (SRC1[193:192] * 64))[63:0];

FI;

IF VL = 512

TMP_DEST[63:0] (TMP_SRC2[511:0] >> (SRC1[2:0] * 64))[63:0]; TMP_DEST[127:64] (TMP_SRC2[511:0] >> (SRC1[66:64] * 64))[63:0]; TMP_DEST[191:128] (TMP_SRC2[511:0] >> (SRC1[130:128] * 64))[63:0]; TMP_DEST[255:192] (TMP_SRC2[511:0] >> (SRC1[194:192] * 64))[63:0];


5-368 Vol. 2C VPERMQ—Qwords Element Permutation



FI;


TMP_DEST[319:256] (TMP_SRC2[511:0] >> (SRC1[258:256] * 64))[63:0]; TMP_DEST[383:320] (TMP_SRC2[511:0] >> (SRC1[322:320] * 64))[63:0]; TMP_DEST[447:384] (TMP_SRC2[511:0] >> (SRC1[386:384] * 64))[63:0]; TMP_DEST[511:448] (TMP_SRC2[511:0] >> (SRC1[450:448] * 64))[63:0];

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0 ;zeroing-masking

FI;

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VPERMQ (VEX.256 encoded version)

DEST[63:0] (SRC[255:0] >> (IMM8[1:0] * 64))[63:0];

DEST[127:64] (SRC[255:0] >> (IMM8[3:2] * 64))[63:0];

DEST[191:128] (SRC[255:0] >> (IMM8[5:4] * 64))[63:0];

DEST[255:192] (SRC[255:0] >> (IMM8[7:6] * 64))[63:0]; DEST[MAXVL-1:256] 0


Intel C/C++ Compiler Intrinsic Equivalent

VPERMQ m512i _mm512_permutex_epi64( m512i a, int imm);

VPERMQ m512i _mm512_mask_permutex_epi64( m512i s, mmask8 k, m512i a, int imm); VPERMQ m512i _mm512_maskz_permutex_epi64( mmask8 k, m512i a, int imm);

VPERMQ m512i _mm512_permutexvar_epi64( m512i a, m512i b);

VPERMQ m512i _mm512_mask_permutexvar_epi64( m512i s, mmask8 k, m512i a, m512i b); VPERMQ m512i _mm512_maskz_permutexvar_epi64( mmask8 k, m512i a, m512i b);

VPERMQ m256i _mm256_permutex_epi64( m256i a, int imm);

VPERMQ m256i _mm256_mask_permutex_epi64( m256i s, mmask8 k, m256i a, int imm); VPERMQ m256i _mm256_maskz_permutex_epi64( mmask8 k, m256i a, int imm);

VPERMQ m256i _mm256_permutexvar_epi64( m256i a, m256i b);

VPERMQ m256i _mm256_mask_permutexvar_epi64( m256i s, mmask8 k, m256i a, m256i b); VPERMQ m256i _mm256_maskz_permutexvar_epi64( mmask8 k, m256i a, m256i b);


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4; additionally

#UD If VEX.L = 0.

If VEX.vvvv != 1111B.

EVEX-encoded instruction, see Exceptions Type E4NF.

#UD If encoded with EVEX.128.

If EVEX.vvvv != 1111B and with imm8.


VPERMT2B—Full Permute of Bytes from Two Tables Overwriting a Table

Opcode/ Instruction

Op

/ En

64/32

bit Mode Support

CPUID Feature Flag

Description

EVEX.DDS.128.66.0F38.W0 7D /r

VPERMT2B xmm1 {k1}{z}, xmm2, xmm3/m128

A

V/V

AVX512VL AVX512_VBMI

Permute bytes in xmm3/m128 and xmm1 using byte indexes in xmm2 and store the byte results in xmm1 using writemask k1.

EVEX.NDS.256.66.0F38.W0 7D /r

VPERMT2B ymm1 {k1}{z}, ymm2, ymm3/m256

A

V/V

AVX512VL AVX512_VBMI

Permute bytes in ymm3/m256 and ymm1 using byte indexes in ymm2 and store the byte results in ymm1 using writemask k1.

EVEX.NDS.512.66.0F38.W0 7D /r

VPERMT2B zmm1 {k1}{z}, zmm2, zmm3/m512

A

V/V

AVX512_VBMI

Permute bytes in zmm3/m512 and zmm1 using byte indexes in zmm2 and store the byte results in zmm1 using writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full Mem

ModRM:reg (r, w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA

Description

Permutes byte values from two tables, comprising of the first operand (also the destination operand) and the third operand (the second source operand). The second operand (the first source operand) provides byte indices to select byte results from the two tables. The selected byte elements are written to the destination at byte granu- larity under the writemask k1.

The first and second operands are ZMM/YMM/XMM registers. The second operand contains input indices to select elements from the two input tables in the 1st and 3rd operands. The first operand is also the destination of the result. The second source operand can be a ZMM/YMM/XMM register, or a 512/256/128-bit memory location. In each index byte, the id bit for table selection is bit 6/5/4, and bits [5:0]/[4:0]/[3:0] selects element within each input table.

Note that these instructions permit a byte value in the source operands to be copied to more than one location in the destination operand. Also, the second table and the indices can be reused in subsequent iterations, but the first table is overwritten.

Bits (MAX_VL-1:256/128) of the destination are zeroed for VL=256,128.



Operation

VPERMT2B (EVEX encoded versions) (KL, VL) = (16, 128), (32, 256), (64, 512) IF VL = 128:

id 3;

ELSE IF VL = 256:

id 4;

ELSE IF VL = 512:

id 5;

FI;

TMP_DEST[VL-1:0] DEST[VL-1:0];

FOR j 0 TO KL-1

off 8*SRC1[j*8 + id: j*8] ; IF k1[j] OR *no writemask*:

DEST[j*8 + 7: j*8] SRC1[j*8+id+1]? SRC2[off+7:off] : TMP_DEST[off+7:off]; ELSE IF *zeroing-masking*

DEST[j*8 + 7: j*8] 0;

*ELSE

DEST[j*8 + 7: j*8] remains unchanged*

FI; ENDFOR

DEST[MAX_VL-1:VL] 0;


Intel C/C++ Compiler Intrinsic Equivalent

VPERMT2B m512i _mm512_permutex2var_epi8( m512i a, m512i idx, m512i b);

VPERMT2B m512i _mm512_mask_permutex2var_epi8( m512i a, mmask64 k, m512i idx, m512i b); VPERMT2B m512i _mm512_maskz_permutex2var_epi8( mmask64 k, m512i a, m512i idx, m512i b); VPERMT2B m256i _mm256_permutex2var_epi8( m256i a, m256i idx, m256i b);

VPERMT2B m256i _mm256_mask_permutex2var_epi8( m256i a, mmask32 k, m256i idx, m256i b); VPERMT2B m256i _mm256_maskz_permutex2var_epi8( mmask32 k, m256i a, m256i idx, m256i b); VPERMT2B m128i _mm_permutex2var_epi8( m128i a, m128i idx, m128i b);

VPERMT2B m128i _mm_mask_permutex2var_epi8( m128i a, mmask16 k, m128i idx, m128i b); VPERMT2B m128i _mm_maskz_permutex2var_epi8( mmask16 k, m128i a, m128i idx, m128i b);


SIMD Floating-Point Exceptions

None.


Other Exceptions

See Exceptions Type E4NF.nb.


VPERMT2W/D/Q/PS/PD—Full Permute from Two Tables Overwriting one Table

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.DDS.128.66.0F38.W1 7D /r

VPERMT2W xmm1 {k1}{z}, xmm2, xmm3/m128

A

V/V

AVX512VL AVX512BW

Permute word integers from two tables in xmm3/m128 and xmm1 using indexes in xmm2 and store the result in xmm1 using writemask k1.

EVEX.DDS.256.66.0F38.W1 7D /r

VPERMT2W ymm1 {k1}{z}, ymm2, ymm3/m256

A

V/V

AVX512VL AVX512BW

Permute word integers from two tables in ymm3/m256 and ymm1 using indexes in ymm2 and store the result in ymm1 using writemask k1.

EVEX.DDS.512.66.0F38.W1 7D /r

VPERMT2W zmm1 {k1}{z}, zmm2, zmm3/m512

A

V/V

AVX512BW

Permute word integers from two tables in zmm3/m512 and zmm1 using indexes in zmm2 and store the result in zmm1 using writemask k1.

EVEX.DDS.128.66.0F38.W0 7E /r

VPERMT2D xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst

B

V/V

AVX512VL AVX512F

Permute double-words from two tables in xmm3/m128/m32bcst and xmm1 using indexes in xmm2 and store the result in xmm1 using writemask k1.

EVEX.DDS.256.66.0F38.W0 7E /r

VPERMT2D ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst

B

V/V

AVX512VL AVX512F

Permute double-words from two tables in ymm3/m256/m32bcst and ymm1 using indexes in ymm2 and store the result in ymm1 using writemask k1.

EVEX.DDS.512.66.0F38.W0 7E /r

VPERMT2D zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst

B

V/V

AVX512F

Permute double-words from two tables in zmm3/m512/m32bcst and zmm1 using indices in zmm2 and store the result in zmm1 using writemask k1.

EVEX.DDS.128.66.0F38.W1 7E /r

VPERMT2Q xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst

B

V/V

AVX512VL AVX512F

Permute quad-words from two tables in xmm3/m128/m64bcst and xmm1 using indexes in xmm2 and store the result in xmm1 using writemask k1.

EVEX.DDS.256.66.0F38.W1 7E /r

VPERMT2Q ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst

B

V/V

AVX512VL AVX512F

Permute quad-words from two tables in ymm3/m256/m64bcst and ymm1 using indexes in ymm2 and store the result in ymm1 using writemask k1.

EVEX.DDS.512.66.0F38.W1 7E /r

VPERMT2Q zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst

B

V/V

AVX512F

Permute quad-words from two tables in zmm3/m512/m64bcst and zmm1 using indices in zmm2 and store the result in zmm1 using writemask k1.

EVEX.DDS.128.66.0F38.W0 7F /r VPERMT2PS xmm1 {k1}{z},

xmm2, xmm3/m128/m32bcst

B

V/V

AVX512VL AVX512F

Permute single-precision FP values from two tables in xmm3/m128/m32bcst and xmm1 using indexes in xmm2 and store the result in xmm1 using writemask k1.

EVEX.DDS.256.66.0F38.W0 7F /r VPERMT2PS ymm1 {k1}{z},

ymm2, ymm3/m256/m32bcst

B

V/V

AVX512VL AVX512F

Permute single-precision FP values from two tables in ymm3/m256/m32bcst and ymm1 using indexes in ymm2 and store the result in ymm1 using writemask k1.

EVEX.DDS.512.66.0F38.W0 7F /r VPERMT2PS zmm1 {k1}{z},

zmm2, zmm3/m512/m32bcst

B

V/V

AVX512F

Permute single-precision FP values from two tables in zmm3/m512/m32bcst and zmm1 using indices in zmm2 and store the result in zmm1 using writemask k1.

EVEX.DDS.128.66.0F38.W1 7F /r VPERMT2PD xmm1 {k1}{z},

xmm2, xmm3/m128/m64bcst

B

V/V

AVX512VL AVX512F

Permute double-precision FP values from two tables in xmm3/m128/m64bcst and xmm1 using indexes in xmm2 and store the result in xmm1 using writemask k1.

EVEX.DDS.256.66.0F38.W1 7F /r VPERMT2PD ymm1 {k1}{z},

ymm2, ymm3/m256/m64bcst

B

V/V

AVX512VL AVX512F

Permute double-precision FP values from two tables in ymm3/m256/m64bcst and ymm1 using indexes in ymm2 and store the result in ymm1 using writemask k1.

EVEX.DDS.512.66.0F38.W1 7F /r VPERMT2PD zmm1 {k1}{z},

zmm2, zmm3/m512/m64bcst

B

V/V

AVX512F

Permute double-precision FP values from two tables in zmm3/m512/m64bcst and zmm1 using indices in zmm2 and store the result in zmm1 using writemask k1.



Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Full Mem

ModRM:reg (r,w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA

B

Full

ModRM:reg (r, w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA

Description

Permutes 16-bit/32-bit/64-bit values in the first operand and the third operand (the second source operand) using indices in the second operand (the first source operand) to select elements from the first and third operands. The selected elements are written to the destination operand (the first operand) according to the writemask k1.

The first and second operands are ZMM/YMM/XMM registers. The second operand contains input indices to select elements from the two input tables in the 1st and 3rd operands. The first operand is also the destination of the result.

D/Q/PS/PD element versions: The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. Broadcast from the low 32/64-bit memory location is performed if EVEX.b and the id bit for table selection are set (selecting table_2).

Dword/PS versions: The id bit for table selection is bit 4/3/2, depending on VL=512, 256, 128. Bits [3:0]/[2:0]/[1:0] of each element in the input index vector select an element within the two source operands, If the id bit is 0, table_1 (the first source) is selected; otherwise the second source operand is selected.

Qword/PD versions: The id bit for table selection is bit 3/2/1, and bits [2:0]/[1:0] /bit 0 selects element within each input table.

Word element versions: The second source operand can be a ZMM/YMM/XMM register, or a 512/256/128-bit memory location. The id bit for table selection is bit 5/4/3, and bits [4:0]/[3:0]/[2:0] selects element within each input table.

Note that these instructions permit a 16-bit/32-bit/64-bit value in the source operands to be copied to more than one location in the destination operand. Note also that in this case, the same index can be reused for example for a second iteration, while the table elements being permuted are overwritten.

Bits (MAXVL-1:256/128) of the destination are zeroed for VL=256,128.


Operation

VPERMT2W (EVEX encoded versions)

(KL, VL) = (8, 128), (16, 256), (32, 512)

IF VL = 128

id 2

FI;

IF VL = 256

id 3

FI;

IF VL = 512

id 4

FI;

TMP_DEST DEST FOR j 0 TO KL-1

i j * 16

off 16*SRC1[i+id:i]

IF k1[j] OR *no writemask* THEN

DEST[i+15:i]=SRC1[i+id+1] ? SRC2[off+15:off]

: TMP_DEST[off+15:off]

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged*

ELSE ; zeroing-masking



DEST[i+15:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VPERMT2D/VPERMT2PS (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)

IF VL = 128

id 1

FI;

IF VL = 256

id 2

FI;

IF VL = 512

id 3

FI;

TMP_DEST DEST FOR j 0 TO KL-1

i j * 32

off 32*SRC1[i+id:i]

IF k1[j] OR *no writemask* THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN

DEST[i+31:i] SRC1[i+id+1] ? SRC2[31:0]

: TMP_DEST[off+31:off]

ELSE

DEST[i+31:i] SRC1[i+id+1] ? SRC2[off+31:off]

: TMP_DEST[off+31:off]

FI ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VPERMT2Q/VPERMT2PD (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256), (8 512)

IF VL = 128

id 0

FI;

IF VL = 256

id 1

FI;

IF VL = 512

id 2

FI;

TMP_DEST DEST FOR j 0 TO KL-1



i j * 64

off 64*SRC1[i+id:i]

IF k1[j] OR *no writemask* THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN

DEST[i+63:i] SRC1[i+id+1] ? SRC2[63:0]

: TMP_DEST[off+63:off]

ELSE

DEST[i+63:i] SRC1[i+id+1] ? SRC2[off+63:off]

: TMP_DEST[off+63:off]

FI ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


Intel C/C++ Compiler Intrinsic Equivalent

VPERMT2D m512i _mm512_permutex2var_epi32( m512i a, m512i idx, m512i b);

VPERMT2D m512i _mm512_mask_permutex2var_epi32( m512i a, mmask16 k, m512i idx, m512i b); VPERMT2D m512i _mm512_mask2_permutex2var_epi32( m512i a, m512i idx, mmask16 k, m512i b); VPERMT2D m512i _mm512_maskz_permutex2var_epi32( mmask16 k, m512i a, m512i idx, m512i b); VPERMT2D m256i _mm256_permutex2var_epi32( m256i a, m256i idx, m256i b);

VPERMT2D m256i _mm256_mask_permutex2var_epi32( m256i a, mmask8 k, m256i idx, m256i b); VPERMT2D m256i _mm256_mask2_permutex2var_epi32( m256i a, m256i idx, mmask8 k, m256i b); VPERMT2D m256i _mm256_maskz_permutex2var_epi32( mmask8 k, m256i a, m256i idx, m256i b); VPERMT2D m128i _mm_permutex2var_epi32( m128i a, m128i idx, m128i b);

VPERMT2D m128i _mm_mask_permutex2var_epi32( m128i a, mmask8 k, m128i idx, m128i b); VPERMT2D m128i _mm_mask2_permutex2var_epi32( m128i a, m128i idx, mmask8 k, m128i b); VPERMT2D m128i _mm_maskz_permutex2var_epi32( mmask8 k, m128i a, m128i idx, m128i b); VPERMT2PD m512d _mm512_permutex2var_pd( m512d a, m512i idx, m512d b);

VPERMT2PD m512d _mm512_mask_permutex2var_pd( m512d a, mmask8 k, m512i idx, m512d b); VPERMT2PD m512d _mm512_mask2_permutex2var_pd( m512d a, m512i idx, mmask8 k, m512d b); VPERMT2PD m512d _mm512_maskz_permutex2var_pd( mmask8 k, m512d a, m512i idx, m512d b); VPERMT2PD m256d _mm256_permutex2var_pd( m256d a, m256i idx, m256d b);

VPERMT2PD m256d _mm256_mask_permutex2var_pd( m256d a, mmask8 k, m256i idx, m256d b); VPERMT2PD m256d _mm256_mask2_permutex2var_pd( m256d a, m256i idx, mmask8 k, m256d b); VPERMT2PD m256d _mm256_maskz_permutex2var_pd( mmask8 k, m256d a, m256i idx, m256d b); VPERMT2PD m128d _mm_permutex2var_pd( m128d a, m128i idx, m128d b);

VPERMT2PD m128d _mm_mask_permutex2var_pd( m128d a, mmask8 k, m128i idx, m128d b); VPERMT2PD m128d _mm_mask2_permutex2var_pd( m128d a, m128i idx, mmask8 k, m128d b); VPERMT2PD m128d _mm_maskz_permutex2var_pd( mmask8 k, m128d a, m128i idx, m128d b); VPERMT2PS m512 _mm512_permutex2var_ps( m512 a, m512i idx, m512 b);

VPERMT2PS m512 _mm512_mask_permutex2var_ps( m512 a, mmask16 k, m512i idx, m512 b); VPERMT2PS m512 _mm512_mask2_permutex2var_ps( m512 a, m512i idx, mmask16 k, m512 b); VPERMT2PS m512 _mm512_maskz_permutex2var_ps( mmask16 k, m512 a, m512i idx, m512 b);


VPERMT2PS m256 _mm256_permutex2var_ps( m256 a, m256i idx, m256 b);

VPERMT2PS m256 _mm256_mask_permutex2var_ps( m256 a, mmask8 k, m256i idx, m256 b); VPERMT2PS m256 _mm256_mask2_permutex2var_ps( m256 a, m256i idx, mmask8 k, m256 b); VPERMT2PS m256 _mm256_maskz_permutex2var_ps( mmask8 k, m256 a, m256i idx, m256 b); VPERMT2PS m128 _mm_permutex2var_ps( m128 a, m128i idx, m128 b);

VPERMT2PS m128 _mm_mask_permutex2var_ps( m128 a, mmask8 k, m128i idx, m128 b); VPERMT2PS m128 _mm_mask2_permutex2var_ps( m128 a, m128i idx, mmask8 k, m128 b); VPERMT2PS m128 _mm_maskz_permutex2var_ps( mmask8 k, m128 a, m128i idx, m128 b); VPERMT2Q m512i _mm512_permutex2var_epi64( m512i a, m512i idx, m512i b);

VPERMT2Q m512i _mm512_mask_permutex2var_epi64( m512i a, mmask8 k, m512i idx, m512i b); VPERMT2Q m512i _mm512_mask2_permutex2var_epi64( m512i a, m512i idx, mmask8 k, m512i b); VPERMT2Q m512i _mm512_maskz_permutex2var_epi64( mmask8 k, m512i a, m512i idx, m512i b); VPERMT2Q m256i _mm256_permutex2var_epi64( m256i a, m256i idx, m256i b);

VPERMT2Q m256i _mm256_mask_permutex2var_epi64( m256i a, mmask8 k, m256i idx, m256i b); VPERMT2Q m256i _mm256_mask2_permutex2var_epi64( m256i a, m256i idx, mmask8 k, m256i b); VPERMT2Q m256i _mm256_maskz_permutex2var_epi64( mmask8 k, m256i a, m256i idx, m256i b); VPERMT2Q m128i _mm_permutex2var_epi64( m128i a, m128i idx, m128i b);

VPERMT2Q m128i _mm_mask_permutex2var_epi64( m128i a, mmask8 k, m128i idx, m128i b); VPERMT2Q m128i _mm_mask2_permutex2var_epi64( m128i a, m128i idx, mmask8 k, m128i b); VPERMT2Q m128i _mm_maskz_permutex2var_epi64( mmask8 k, m128i a, m128i idx, m128i b); VPERMT2W m512i _mm512_permutex2var_epi16( m512i a, m512i idx, m512i b);

VPERMT2W m512i _mm512_mask_permutex2var_epi16( m512i a, mmask32 k, m512i idx, m512i b); VPERMT2W m512i _mm512_mask2_permutex2var_epi16( m512i a, m512i idx, mmask32 k, m512i b); VPERMT2W m512i _mm512_maskz_permutex2var_epi16( mmask32 k, m512i a, m512i idx, m512i b); VPERMT2W m256i _mm256_permutex2var_epi16( m256i a, m256i idx, m256i b);

VPERMT2W m256i _mm256_mask_permutex2var_epi16( m256i a, mmask16 k, m256i idx, m256i b); VPERMT2W m256i _mm256_mask2_permutex2var_epi16( m256i a, m256i idx, mmask16 k, m256i b); VPERMT2W m256i _mm256_maskz_permutex2var_epi16( mmask16 k, m256i a, m256i idx, m256i b); VPERMT2W m128i _mm_permutex2var_epi16( m128i a, m128i idx, m128i b);

VPERMT2W m128i _mm_mask_permutex2var_epi16( m128i a, mmask8 k, m128i idx, m128i b); VPERMT2W m128i _mm_mask2_permutex2var_epi16( m128i a, m128i idx, mmask8 k, m128i b); VPERMT2W m128i _mm_maskz_permutex2var_epi16( mmask8 k, m128i a, m128i idx, m128i b);

SIMD Floating-Point Exceptions

None.


Other Exceptions

VPERMT2D/Q/PS/PD: See Exceptions Type E4NF.

VPERMT2W: See Exceptions Type E4NF.nb.


VPEXPANDD—Load Sparse Packed Doubleword Integer Values from Dense Memory / Register

Opcode/ Instruction

Op/ En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.66.0F38.W0 89 /r VPEXPANDD xmm1 {k1}{z},

xmm2/m128

A

V/V

AVX512VL AVX512F

Expand packed double-word integer values from xmm2/m128 to xmm1 using writemask k1.

EVEX.256.66.0F38.W0 89 /r VPEXPANDD ymm1 {k1}{z},

ymm2/m256

A

V/V

AVX512VL AVX512F

Expand packed double-word integer values from ymm2/m256 to ymm1 using writemask k1.

EVEX.512.66.0F38.W0 89 /r VPEXPANDD zmm1 {k1}{z},

zmm2/m512

A

V/V

AVX512F

Expand packed double-word integer values from zmm2/m512 to zmm1 using writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Tuple1 Scalar

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Expand (load) up to 16 contiguous doubleword integer values of the input vector in the source operand (the second operand) to sparse elements in the destination operand (the first operand), selected by the writemask k1. The destination operand is a ZMM register, the source operand can be a ZMM register or memory location.

The input vector starts from the lowest element in the source operand. The opmask register k1 selects the desti- nation elements (a partial vector or sparse elements if less than 8 elements) to be replaced by the ascending elements in the input vector. Destination elements not selected by the writemask k1 are either unmodified or zeroed, depending on EVEX.z.

Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element instead of the size of the full vector.


Operation

VPEXPANDD (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)

k 0

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask* THEN

DEST[i+31:i] SRC[k+31:k];

k k + 32 ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



Intel C/C++ Compiler Intrinsic Equivalent

VPEXPANDD m512i _mm512_mask_expandloadu_epi32( m512i s, mmask16 k, void * a); VPEXPANDD m512i _mm512_maskz_expandloadu_epi32( mmask16 k, void * a); VPEXPANDD m512i _mm512_mask_expand_epi32( m512i s, mmask16 k, m512i a); VPEXPANDD m512i _mm512_maskz_expand_epi32( mmask16 k, m512i a); VPEXPANDD m256i _mm256_mask_expandloadu_epi32( m256i s, mmask8 k, void * a); VPEXPANDD m256i _mm256_maskz_expandloadu_epi32( mmask8 k, void * a); VPEXPANDD m256i _mm256_mask_expand_epi32( m256i s, mmask8 k, m256i a); VPEXPANDD m256i _mm256_maskz_expand_epi32( mmask8 k, m256i a);

VPEXPANDD m128i _mm_mask_expandloadu_epi32( m128i s, mmask8 k, void * a); VPEXPANDD m128i _mm_maskz_expandloadu_epi32( mmask8 k, void * a); VPEXPANDD m128i _mm_mask_expand_epi32( m128i s, mmask8 k, m128i a); VPEXPANDD m128i _mm_maskz_expand_epi32( mmask8 k, m128i a);


SIMD Floating-Point Exceptions

None


Other Exceptions

EVEX-encoded instruction, see Exceptions Type E4.nb.

#UD If EVEX.vvvv != 1111B.


VPEXPANDQ—Load Sparse Packed Quadword Integer Values from Dense Memory / Register

Opcode/ Instruction

Op/ En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.128.66.0F38.W1 89 /r

VPEXPANDQ xmm1 {k1}{z}, xmm2/m128

A

V/V

AVX512VL AVX512F

Expand packed quad-word integer values from xmm2/m128 to xmm1 using writemask k1.

EVEX.256.66.0F38.W1 89 /r

VPEXPANDQ ymm1 {k1}{z}, ymm2/m256

A

V/V

AVX512VL AVX512F

Expand packed quad-word integer values from ymm2/m256 to ymm1 using writemask k1.

EVEX.512.66.0F38.W1 89 /r

VPEXPANDQ zmm1 {k1}{z}, zmm2/m512

A

V/V

AVX512F

Expand packed quad-word integer values from zmm2/m512 to zmm1 using writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

Tuple1 Scalar

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

Expand (load) up to 8 quadword integer values from the source operand (the second operand) to sparse elements in the destination operand (the first operand), selected by the writemask k1. The destination operand is a ZMM register, the source operand can be a ZMM register or memory location.

The input vector starts from the lowest element in the source operand. The opmask register k1 selects the desti- nation elements (a partial vector or sparse elements if less than 8 elements) to be replaced by the ascending elements in the input vector. Destination elements not selected by the writemask k1 are either unmodified or zeroed, depending on EVEX.z.

Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element instead of the size of the full vector.


Operation

VPEXPANDQ (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256), (8, 512)

k 0

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask* THEN

DEST[i+63:i] SRC[k+63:k];

k k + 64 ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

THEN DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



Intel C/C++ Compiler Intrinsic Equivalent

VPEXPANDQ m512i _mm512_mask_expandloadu_epi64( m512i s, mmask8 k, void * a); VPEXPANDQ m512i _mm512_maskz_expandloadu_epi64( mmask8 k, void * a); VPEXPANDQ m512i _mm512_mask_expand_epi64( m512i s, mmask8 k, m512i a); VPEXPANDQ m512i _mm512_maskz_expand_epi64( mmask8 k, m512i a); VPEXPANDQ m256i _mm256_mask_expandloadu_epi64( m256i s, mmask8 k, void * a); VPEXPANDQ m256i _mm256_maskz_expandloadu_epi64( mmask8 k, void * a); VPEXPANDQ m256i _mm256_mask_expand_epi64( m256i s, mmask8 k, m256i a); VPEXPANDQ m256i _mm256_maskz_expand_epi64( mmask8 k, m256i a); VPEXPANDQ m128i _mm_mask_expandloadu_epi64( m128i s, mmask8 k, void * a); VPEXPANDQ m128i _mm_maskz_expandloadu_epi64( mmask8 k, void * a);

VPEXPANDQ m128i _mm_mask_expand_epi64( m128i s, mmask8 k, m128i a); VPEXPANDQ m128i _mm_maskz_expand_epi64( mmask8 k, m128i a);


SIMD Floating-Point Exceptions

None


Other Exceptions

EVEX-encoded instruction, see Exceptions Type E4.nb.

#UD If EVEX.vvvv != 1111B.


VPGATHERDD/VPGATHERQD — Gather Packed Dword Values Using Signed Dword/Qword Indices

Opcode/ Instruction

Op/ En

64/32

-bit Mode

CPUID

Feature Flag

Description

VEX.DDS.128.66.0F38.W0 90 /r

VPGATHERDD xmm1, vm32x, xmm2

RMV

V/V

AVX2

Using dword indices specified in vm32x, gather dword val- ues from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.

VEX.DDS.128.66.0F38.W0 91 /r

VPGATHERQD xmm1, vm64x, xmm2

RMV

V/V

AVX2

Using qword indices specified in vm64x, gather dword val- ues from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.

VEX.DDS.256.66.0F38.W0 90 /r

VPGATHERDD ymm1, vm32y, ymm2

RMV

V/V

AVX2

Using dword indices specified in vm32y, gather dword from memory conditioned on mask specified by ymm2. Conditionally gathered elements are merged into ymm1.

VEX.DDS.256.66.0F38.W0 91 /r

VPGATHERQD xmm1, vm64y, xmm2

RMV

V/V

AVX2

Using qword indices specified in vm64y, gather dword val- ues from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

RMV

ModRM:reg (r,w)

BaseReg (R): VSIB:base, VectorReg(R): VSIB:index

VEX.vvvv (r, w)

NA


Description

The instruction conditionally loads up to 4 or 8 dword values from memory addresses specified by the memory operand (the second operand) and using dword indices. The memory operand uses the VSIB form of the SIB byte to specify a general purpose register operand as the common base, a vector register for an array of indices relative to the base and a constant scale factor.

The mask operand (the third operand) specifies the conditional load operation from each memory address and the corresponding update of each data element of the destination operand (the first operand). Conditionality is speci- fied by the most significant bit of each data element of the mask register. If an element’s mask bit is not set, the corresponding element of the destination register is left unchanged. The width of data element in the destination register and mask register are identical. The entire mask register will be set to zero by this instruction unless the instruction causes an exception.

Using qword indices, the instruction conditionally loads up to 2 or 4 qword values from the VSIB addressing memory operand, and updates the lower half of the destination register. The upper 128 or 256 bits of the destina- tion register are zero’ed with qword indices.

This instruction can be suspended by an exception if at least one element is already gathered (i.e., if the exception is triggered by an element other than the rightmost one with its mask bit set). When this happens, the destination register and the mask operand are partially updated; those elements that have been gathered are placed into the destination register and have their mask bits set to zero. If any traps or interrupts are pending from already gath- ered elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruction breakpoint is not re-triggered when the instruction is continued.

If the data size and index size are different, part of the destination register and part of the mask register do not correspond to any elements being gathered. This instruction sets those parts to zero. It may do this to one or both of those registers even if the instruction triggers an exception, and even if the instruction triggers the exception before gathering any elements.

VEX.128 version: For dword indices, the instruction will gather four dword values. For qword indices, the instruc- tion will gather two values and zero the upper 64 bits of the destination.



VEX.256 version: For dword indices, the instruction will gather eight dword values. For qword indices, the instruc- tion will gather four values and zero the upper 128 bits of the destination.

Note that:

Use of a destination operand not aligned to 64-byte boundary (in either 64-bit or 32-bit modes) results in a general-protection (#GP) exception. In 64-bit mode, the upper 32 bits of RDX and RAX are ignored.

See Section 13.6, “Processor Tracking of XSAVE-Managed State,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1 for discussion of the bitmap XMODIFIED and of the quantity XRSTOR_INFO.


image

  1. There is an exception for state component 1 (SSE). MXCSR is part of SSE state, but XINUSE[1] may be 0 even if MXCSR does not have its initial value of 1F80H. In this case, the init optimization does not apply and XSAVEC will save SSE state as long as RFBM[1] = 1 and the modified optimization is not being applied.

  2. There is an exception for state component 1 (SSE). MXCSR is part of SSE state, but XINUSE[1] may be 0 even if MXCSR does not have its initial value of 1F80H. In this case, XSAVES sets XSTATE_BV[1] to 1 as long as RFBM[1] = 1.



Operation

RFBM (XCR0 OR IA32_XSS) AND EDX:EAX; /* bitwise logical OR and AND */ IF in VMX non-root operation

THEN VMXNR 1; ELSE VMXNR 0;

FI;

LAXA linear address of XSAVE area; COMPMASK RFBM OR 80000000_00000000H; TO_BE_SAVED RFBM AND XINUSE;

IF XRSTOR_INFO = CPL,VMXNR,LAXA,COMPMASK

THEN TO_BE_SAVED TO_BE_SAVED AND XMODIFIED;

FI;

If MXCSR ≠ 1F80H AND RFBM[1] TO_BE_SAVED[1] = 1;

FI;


IF TO_BE_SAVED[0] = 1

THEN store x87 state into legacy region of XSAVE area;

FI;


IF TO_BE_SAVED[1] = 1

THEN store SSE state into legacy region of XSAVE area; // this step saves the XMM registers, MXCSR, and MXCSR_MASK

FI;


NEXT_FEATURE_OFFSET = 576; // Legacy area and XSAVE header consume 576 bytes FOR i 2 TO 62

IF RFBM[i] = 1 THEN

IF TO_BE_SAVED[i] THEN

save XSAVE state component i at offset NEXT_FEATURE_OFFSET from base of XSAVE area; IF i = 8 // state component 8 is for PT state

THEN IA32_RTIT_CTL.TraceEn[bit 0] 0;

FI;

FI;

NEXT_FEATURE_OFFSET = NEXT_FEATURE_OFFSET + n (n enumerated by CPUID(EAX=0DH,ECX=i):EAX);

FI;

ENDFOR;


XSTATE_BV field in XSAVE header TO_BE_SAVED; XCOMP_BV field in XSAVE header COMPMASK;


Flags Affected

None.


Intel C/C++ Compiler Intrinsic Equivalent XSAVES: void _xsaves( void * , unsigned int64); XSAVES64: void _xsaves64( void * , unsigned int64);



Protected Mode Exceptions

#GP(0) If CPL > 0.

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. If a memory operand is not aligned on a 64-byte boundary, regardless of segment.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#NM If CR0.TS[bit 3] = 1.

#UD If CPUID.01H:ECX.XSAVE[bit 26] = 0 or CPUID.(EAX=0DH,ECX=1):EAX.XSS[bit 3] = 0.

If CR4.OSXSAVE[bit 18] = 0.

If the LOCK prefix is used.

#AC If this exception is disabled a general protection exception (#GP) is signaled if the memory operand is not aligned on a 64-byte boundary, as described above. If the alignment check exception (#AC) is enabled (and the CPL is 3), signaling of #AC is not guaranteed and may vary with implementation, as follows. In all implementations where #AC is not signaled, a general protection exception is signaled in its place. In addition, the width of the alignment check may also vary with implementation. For instance, for a given implementation, an align- ment check exception might be signaled for a 2-byte misalignment, whereas a general protec- tion exception might be signaled for all other misalignments (4-, 8-, or 16-byte misalignments).


Real-Address Mode Exceptions

#GP If a memory operand is not aligned on a 64-byte boundary, regardless of segment. If any part of the operand lies outside the effective address space from 0 to FFFFH.

#NM If CR0.TS[bit 3] = 1.

#UD If CPUID.01H:ECX.XSAVE[bit 26] = 0 or CPUID.(EAX=0DH,ECX=1):EAX.XSS[bit 3] = 0.

If CR4.OSXSAVE[bit 18] = 0.

If the LOCK prefix is used.


Virtual-8086 Mode Exceptions

Same exceptions as in protected mode.


Compatibility Mode Exceptions

Same exceptions as in protected mode.



64-Bit Mode Exceptions

#GP(0) If CPL > 0.

If the memory address is in a non-canonical form.

If a memory operand is not aligned on a 64-byte boundary, regardless of segment.

#SS(0) If a memory address referencing the SS segment is in a non-canonical form.

#PF(fault-code) If a page fault occurs.

#NM If CR0.TS[bit 3] = 1.

#UD If CPUID.01H:ECX.XSAVE[bit 26] = 0 or CPUID.(EAX=0DH,ECX=1):EAX.XSS[bit 3] = 0.

If CR4.OSXSAVE[bit 18] = 0.

If the LOCK prefix is used.

#AC If this exception is disabled a general protection exception (#GP) is signaled if the memory operand is not aligned on a 64-byte boundary, as described above. If the alignment check exception (#AC) is enabled (and the CPL is 3), signaling of #AC is not guaranteed and may vary with implementation, as follows. In all implementations where #AC is not signaled, a general protection exception is signaled in its place. In addition, the width of the alignment check may also vary with implementation. For instance, for a given implementation, an align- ment check exception might be signaled for a 2-byte misalignment, whereas a general protec- tion exception might be signaled for all other misalignments (4-, 8-, or 16-byte misalignments).


XSETBV—Set Extended Control Register

Opcode

Instruction

Op/ En

64-Bit Mode

Compat/ Leg Mode

Description

NP 0F 01 D1

XSETBV

ZO

Valid

Valid

Write the value in EDX:EAX to the XCR specified by ECX.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

ZO

NA

NA

NA

NA


Description

Writes the contents of registers EDX:EAX into the 64-bit extended control register (XCR) specified in the ECX register. (On processors that support the Intel 64 architecture, the high-order 32 bits of RCX are ignored.) The contents of the EDX register are copied to high-order 32 bits of the selected XCR and the contents of the EAX register are copied to low-order 32 bits of the XCR. (On processors that support the Intel 64 architecture, the high- order 32 bits of each of RAX and RDX are ignored.) Undefined or reserved bits in an XCR should be set to values previously read.

This instruction must be executed at privilege level 0 or in real-address mode; otherwise, a general protection exception #GP(0) is generated. Specifying a reserved or unimplemented XCR in ECX will also cause a general protection exception. The processor will also generate a general protection exception if software attempts to write to reserved bits in an XCR.

Currently, only XCR0 is supported. Thus, all other values of ECX are reserved and will cause a #GP(0). Note that bit 0 of XCR0 (corresponding to x87 state) must be set to 1; the instruction will cause a #GP(0) if an attempt is made to clear this bit. In addition, the instruction causes a #GP(0) if an attempt is made to set XCR0[2] (AVX state) while clearing XCR0[1] (SSE state); it is necessary to set both bits to use AVX instructions; Section 13.3, “Enabling the XSAVE Feature Set and XSAVE-Enabled Features,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1.


Operation

XCR[ECX] EDX:EAX;


Flags Affected

None.


Intel C/C++ Compiler Intrinsic Equivalent

XSETBV: void _xsetbv( unsigned int, unsigned int64);


Protected Mode Exceptions

#GP(0) If the current privilege level is not 0. If an invalid XCR is specified in ECX.

If the value in EDX:EAX sets bits that are reserved in the XCR specified by ECX. If an attempt is made to clear bit 0 of XCR0.

If an attempt is made to set XCR0[2:1] to 10b.

#UD If CPUID.01H:ECX.XSAVE[bit 26] = 0.

If CR4.OSXSAVE[bit 18] = 0.

If the LOCK prefix is used.



Real-Address Mode Exceptions

#GP If an invalid XCR is specified in ECX.

If the value in EDX:EAX sets bits that are reserved in the XCR specified by ECX. If an attempt is made to clear bit 0 of XCR0.

If an attempt is made to set XCR0[2:1] to 10b.

#UD If CPUID.01H:ECX.XSAVE[bit 26] = 0.

If CR4.OSXSAVE[bit 18] = 0.

If the LOCK prefix is used.


Virtual-8086 Mode Exceptions

#GP(0) The XSETBV instruction is not recognized in virtual-8086 mode.


Compatibility Mode Exceptions

Same exceptions as in protected mode.


64-Bit Mode Exceptions

Same exceptions as in protected mode.


XTEST — Test If In Transactional Execution

Opcode/Instruction

Op/ En

64/32bit Mode Support

CPUID

Feature Flag

Description

NP 0F 01 D6 XTEST

A

V/V

HLE or RTM

Test if executing in a transactional region


Instruction Operand Encoding

Op/En

Operand 1

Operand2

Operand3

Operand4

A

NA

NA

NA

NA


Description

The XTEST instruction queries the transactional execution status. If the instruction executes inside a transaction- ally executing RTM region or a transactionally executing HLE region, then the ZF flag is cleared, else it is set.


Operation

XTEST

IF (RTM_ACTIVE = 1 OR HLE_ACTIVE = 1) THEN

ZF 0 ELSE

ZF 1

FI;


Flags Affected

The ZF flag is cleared if the instruction is executed transactionally; otherwise it is set to 1. The CF, OF, SF, PF, and AF, flags are cleared.


Intel C/C++ Compiler Intrinsic Equivalent

XTEST: int _xtest( void );


SIMD Floating-Point Exceptions

None


Other Exceptions

#UD CPUID.(EAX=7, ECX=0):EBX.HLE[bit 4] = 0 and CPUID.(EAX=7, ECX=0):EBX.RTM[bit 11] = 0.

If LOCK prefix is used.